* [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
@ 2026-03-25 17:23 Tejun Heo
2026-03-25 17:30 ` Tejun Heo
2026-03-25 18:06 ` Sebastian Andrzej Siewior
0 siblings, 2 replies; 6+ messages in thread
From: Tejun Heo @ 2026-03-25 17:23 UTC (permalink / raw)
To: cgroups, linux-kernel
Cc: Bert Karwatzki, Michal Koutny, Johannes Weiner,
Sebastian Andrzej Siewior, kernel test robot
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 42 ++++++++++++++++++++++--------------------
1 file changed, 22 insertions(+), 20 deletions(-)
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6229,20 +6229,22 @@ static int cgroup_destroy_locked(struct
* cgroup_drain_dying - wait for dying tasks to leave before rmdir
* @cgrp: the cgroup being removed
*
- * The PF_EXITING filter in css_task_iter_advance() hides exiting tasks from
- * cgroup.procs so that userspace (e.g. systemd) doesn't see tasks that have
- * already been reaped via waitpid(). However, the populated counter
- * (nr_populated_csets) is only decremented when the task later passes through
+ * cgroup.procs and cgroup.threads use css_task_iter which filters out
+ * PF_EXITING tasks so that userspace doesn't see tasks that have already been
+ * reaped via waitpid(). However, cgroup_has_tasks() - which tests whether the
+ * cgroup has non-empty css_sets - is only updated when dying tasks pass through
* cgroup_task_dead() in finish_task_switch(). This creates a window where
- * cgroup.procs appears empty but cgroup_is_populated() is still true, causing
- * rmdir to fail with -EBUSY.
+ * cgroup.procs reads empty but cgroup_has_tasks() is still true, making rmdir
+ * fail with -EBUSY from cgroup_destroy_locked() even though userspace sees no
+ * tasks.
*
- * This function bridges that gap. If the cgroup is populated but all remaining
- * tasks have PF_EXITING set, we wait for cgroup_task_dead() to process them.
- * Tasks are removed from the cgroup's css_set in cgroup_task_dead() called from
- * finish_task_switch(). As the window between PF_EXITING and cgroup_task_dead()
- * is short, the number of PF_EXITING tasks on the list is small and the wait
- * is brief.
+ * This function aligns cgroup_has_tasks() with what userspace can observe. If
+ * cgroup_has_tasks() but the task iterator sees nothing (all remaining tasks are
+ * PF_EXITING), we wait for cgroup_task_dead() to finish processing them. As the
+ * window between PF_EXITING and cgroup_task_dead() is short, the wait is brief.
+ *
+ * This function only concerns itself with this cgroup's own dying tasks.
+ * Whether the cgroup has children is cgroup_destroy_locked()'s problem.
*
* Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
* retry the full check from scratch.
@@ -6258,7 +6260,7 @@ static int cgroup_drain_dying(struct cgr
lockdep_assert_held(&cgroup_mutex);
retry:
- if (!cgroup_is_populated(cgrp))
+ if (!cgroup_has_tasks(cgrp))
return 0;
/* Same iterator as cgroup.threads - if any task is visible, it's busy */
@@ -6273,15 +6275,15 @@ retry:
* All remaining tasks are PF_EXITING and will pass through
* cgroup_task_dead() shortly. Wait for a kick and retry.
*
- * cgroup_is_populated() can't transition from false to true while
- * we're holding cgroup_mutex, but the true to false transition
- * happens under css_set_lock (via cgroup_task_dead()). We must
- * retest and prepare_to_wait() under css_set_lock. Otherwise, the
- * transition can happen between our first test and
- * prepare_to_wait(), and we sleep with no one to wake us.
+ * cgroup_has_tasks() can't transition from false to true while we're
+ * holding cgroup_mutex, but the true to false transition happens
+ * under css_set_lock (via cgroup_task_dead()). We must retest and
+ * prepare_to_wait() under css_set_lock. Otherwise, the transition
+ * can happen between our first test and prepare_to_wait(), and we
+ * sleep with no one to wake us.
*/
spin_lock_irq(&css_set_lock);
- if (!cgroup_is_populated(cgrp)) {
+ if (!cgroup_has_tasks(cgrp)) {
spin_unlock_irq(&css_set_lock);
return 0;
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
2026-03-25 17:23 Tejun Heo
@ 2026-03-25 17:30 ` Tejun Heo
2026-03-25 18:06 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-03-25 17:30 UTC (permalink / raw)
To: cgroups, linux-kernel
Cc: Bert Karwatzki, Michal Koutny, Johannes Weiner,
Sebastian Andrzej Siewior, kernel test robot
Please ignore the v3 tag in the subject. This is a new patch on top of the
already applied v2, not a new version.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
2026-03-25 17:23 Tejun Heo
2026-03-25 17:30 ` Tejun Heo
@ 2026-03-25 18:06 ` Sebastian Andrzej Siewior
2026-03-26 0:02 ` Tejun Heo
1 sibling, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-25 18:06 UTC (permalink / raw)
To: Tejun Heo
Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
Johannes Weiner, kernel test robot
On 2026-03-25 07:23:48 [-1000], Tejun Heo wrote:
> cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
> dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
> nr_populated_domain_children and nr_populated_threaded_children, but
> cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
> there are children is cgroup_destroy_locked()'s concern.
…
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
The only issue I see is if I delay the irq_work callback by a second.
Other than that, I don't see any problems.
Sebastian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
2026-03-25 18:06 ` Sebastian Andrzej Siewior
@ 2026-03-26 0:02 ` Tejun Heo
2026-03-26 7:35 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2026-03-26 0:02 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
Johannes Weiner, kernel test robot
On Wed, Mar 25, 2026 at 07:06:23PM +0100, Sebastian Andrzej Siewior wrote:
> On 2026-03-25 07:23:48 [-1000], Tejun Heo wrote:
> > cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
> > dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
> > nr_populated_domain_children and nr_populated_threaded_children, but
> > cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
> > there are children is cgroup_destroy_locked()'s concern.
> …
>
> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> The only issue I see is if I delay the irq_work callback by a second.
> Other than that, I don't see any problems.
What issue do you see when delaying it by a second? Just things being slowed
down?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
[not found] <20260325172348.1836430-1-tj@kernel.org>
@ 2026-03-26 0:09 ` Tejun Heo
0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-03-26 0:09 UTC (permalink / raw)
To: cgroups, linux-kernel
Cc: Sebastian Andrzej Siewior, Michal Koutny, Johannes Weiner
Applied to cgroup/for-7.0-fixes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
2026-03-26 0:02 ` Tejun Heo
@ 2026-03-26 7:35 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-26 7:35 UTC (permalink / raw)
To: Tejun Heo
Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
Johannes Weiner, kernel test robot
On 2026-03-25 14:02:05 [-1000], Tejun Heo wrote:
> > The only issue I see is if I delay the irq_work callback by a second.
> > Other than that, I don't see any problems.
>
> What issue do you see when delaying it by a second? Just things being slowed
> down?
This is during boot:
[ OK ] Mounted sys-kernel-debug.mount - Kernel Debug File System.
[ OK ] Mounted sys-kernel-tracing.mount - Kernel Trace File System.
[ OK ] Mounted tmp.mount - Temporary Directory /tmp.
[ 20.845878] INFO: task systemd:1 blocked for more than 10 seconds.
[ 20.845885] Not tainted 7.0.0-rc5+ #178
[ 20.845887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 20.845888] task:systemd state:D stack:0 pid:1 tgid:1 ppid:0 task_flags:0x400100 flags:0x00080000
[ 20.845906] Call Trace:
[ 20.845911] <TASK>
[ 20.845915] __schedule+0x3db/0xf90
[ 20.845947] schedule+0x27/0xd0
[ 20.845950] cgroup_drain_dying+0x9b/0x190
[ 20.845971] cgroup_rmdir+0x2d/0x100
[ 20.845980] kernfs_iop_rmdir+0x6a/0xd0
[ 20.845993] vfs_rmdir+0x11a/0x280
[ 20.846002] filename_rmdir+0x16f/0x1e0
[ 20.846009] __x64_sys_rmdir+0x28/0x40
[ 20.846015] do_syscall_64+0x119/0x5a0
[ 20.846152] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 20.846158] RIP: 0033:0x7ff495627337
[ 20.846164] RSP: 002b:00007ffd7efa66f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[ 20.846170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff495627337
[ 20.846172] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00005646ca9583a0
[ 20.846173] RBP: 00005646ca9583a0 R08: 000000000000000c R09: 0000000000000000
[ 20.846174] R10: 0000000000000000 R11: 0000000000000246 R12: 00005646ca957ac0
[ 20.846175] R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000000
[ 20.846178] </TASK>
It does not recover. Therefore I think there might be another race
lurking. This is what I talk about:
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -7112,9 +7112,9 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
* irq_work to allow batching while ensuring timely completion.
*/
static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks);
-static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork);
+static DEFINE_PER_CPU(struct delayed_work, cgrp_delayed_tasks_iwork);
-static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
+static void cgrp_dead_tasks_iwork_fn(struct work_struct *iwork)
{
struct llist_node *lnode;
struct task_struct *task, *next;
@@ -7131,9 +7131,11 @@ static void __init cgroup_rt_init(void)
int cpu;
for_each_possible_cpu(cpu) {
+ struct delayed_work *dwork;
+
init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu));
- per_cpu(cgrp_dead_tasks_iwork, cpu) =
- IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn);
+ dwork = &per_cpu(cgrp_delayed_tasks_iwork, cpu);
+ INIT_DELAYED_WORK(dwork, cgrp_dead_tasks_iwork_fn);
}
}
@@ -7141,7 +7143,7 @@ void cgroup_task_dead(struct task_struct *task)
{
get_task_struct(task);
llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
- irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
+ schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
}
#else /* CONFIG_PREEMPT_RT */
static void __init cgroup_rt_init(void) {}
> Thanks.
>
Sebastian
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-26 7:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260325172348.1836430-1-tj@kernel.org>
2026-03-26 0:09 ` [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition Tejun Heo
2026-03-25 17:23 Tejun Heo
2026-03-25 17:30 ` Tejun Heo
2026-03-25 18:06 ` Sebastian Andrzej Siewior
2026-03-26 0:02 ` Tejun Heo
2026-03-26 7:35 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox