[PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
@ 2026-03-25 17:23 Tejun Heo
  2026-03-25 17:30 ` Tejun Heo
  2026-03-25 18:06 ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 6+ messages in thread
From: Tejun Heo @ 2026-03-25 17:23 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Bert Karwatzki, Michal Koutny, Johannes Weiner,
	Sebastian Andrzej Siewior, kernel test robot

cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.

This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.

Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.

v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).

v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup.c |   42 ++++++++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 20 deletions(-)

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6229,20 +6229,22 @@ static int cgroup_destroy_locked(struct
  * cgroup_drain_dying - wait for dying tasks to leave before rmdir
  * @cgrp: the cgroup being removed
  *
- * The PF_EXITING filter in css_task_iter_advance() hides exiting tasks from
- * cgroup.procs so that userspace (e.g. systemd) doesn't see tasks that have
- * already been reaped via waitpid(). However, the populated counter
- * (nr_populated_csets) is only decremented when the task later passes through
+ * cgroup.procs and cgroup.threads use css_task_iter which filters out
+ * PF_EXITING tasks so that userspace doesn't see tasks that have already been
+ * reaped via waitpid(). However, cgroup_has_tasks() - which tests whether the
+ * cgroup has non-empty css_sets - is only updated when dying tasks pass through
  * cgroup_task_dead() in finish_task_switch(). This creates a window where
- * cgroup.procs appears empty but cgroup_is_populated() is still true, causing
- * rmdir to fail with -EBUSY.
+ * cgroup.procs reads empty but cgroup_has_tasks() is still true, making rmdir
+ * fail with -EBUSY from cgroup_destroy_locked() even though userspace sees no
+ * tasks.
  *
- * This function bridges that gap. If the cgroup is populated but all remaining
- * tasks have PF_EXITING set, we wait for cgroup_task_dead() to process them.
- * Tasks are removed from the cgroup's css_set in cgroup_task_dead() called from
- * finish_task_switch(). As the window between PF_EXITING and cgroup_task_dead()
- * is short, the number of PF_EXITING tasks on the list is small and the wait
- * is brief.
+ * This function aligns cgroup_has_tasks() with what userspace can observe. If
+ * cgroup_has_tasks() but the task iterator sees nothing (all remaining tasks are
+ * PF_EXITING), we wait for cgroup_task_dead() to finish processing them. As the
+ * window between PF_EXITING and cgroup_task_dead() is short, the wait is brief.
+ *
+ * This function only concerns itself with this cgroup's own dying tasks.
+ * Whether the cgroup has children is cgroup_destroy_locked()'s problem.
  *
  * Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
  * retry the full check from scratch.
@@ -6258,7 +6260,7 @@ static int cgroup_drain_dying(struct cgr

 	lockdep_assert_held(&cgroup_mutex);
 retry:
-	if (!cgroup_is_populated(cgrp))
+	if (!cgroup_has_tasks(cgrp))
 		return 0;

 	/* Same iterator as cgroup.threads - if any task is visible, it's busy */
@@ -6273,15 +6275,15 @@ retry:
 	 * All remaining tasks are PF_EXITING and will pass through
 	 * cgroup_task_dead() shortly. Wait for a kick and retry.
 	 *
-	 * cgroup_is_populated() can't transition from false to true while
-	 * we're holding cgroup_mutex, but the true to false transition
-	 * happens under css_set_lock (via cgroup_task_dead()). We must
-	 * retest and prepare_to_wait() under css_set_lock. Otherwise, the
-	 * transition can happen between our first test and
-	 * prepare_to_wait(), and we sleep with no one to wake us.
+	 * cgroup_has_tasks() can't transition from false to true while we're
+	 * holding cgroup_mutex, but the true to false transition happens
+	 * under css_set_lock (via cgroup_task_dead()). We must retest and
+	 * prepare_to_wait() under css_set_lock. Otherwise, the transition
+	 * can happen between our first test and prepare_to_wait(), and we
+	 * sleep with no one to wake us.
 	 */
 	spin_lock_irq(&css_set_lock);
-	if (!cgroup_is_populated(cgrp)) {
+	if (!cgroup_has_tasks(cgrp)) {
 		spin_unlock_irq(&css_set_lock);
 		return 0;
 	}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
  2026-03-25 17:23 Tejun Heo
@ 2026-03-25 17:30 ` Tejun Heo
  2026-03-25 18:06 ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-03-25 17:30 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Bert Karwatzki, Michal Koutny, Johannes Weiner,
	Sebastian Andrzej Siewior, kernel test robot

Please ignore the v3 tag in the subject. This is a new patch on top of the
already applied v2, not a new version.

--
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
  2026-03-25 17:23 Tejun Heo
  2026-03-25 17:30 ` Tejun Heo
@ 2026-03-25 18:06 ` Sebastian Andrzej Siewior
  2026-03-26  0:02   ` Tejun Heo
  1 sibling, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-25 18:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

On 2026-03-25 07:23:48 [-1000], Tejun Heo wrote:
> cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
> dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
> nr_populated_domain_children and nr_populated_threaded_children, but
> cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
> there are children is cgroup_destroy_locked()'s concern.
…

Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The only issue I see is if I delay the irq_work callback by a second.
Other than that, I don't see any problems.

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
  2026-03-25 18:06 ` Sebastian Andrzej Siewior
@ 2026-03-26  0:02   ` Tejun Heo
  2026-03-26  7:35     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2026-03-26  0:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

On Wed, Mar 25, 2026 at 07:06:23PM +0100, Sebastian Andrzej Siewior wrote:
> On 2026-03-25 07:23:48 [-1000], Tejun Heo wrote:
> > cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
> > dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
> > nr_populated_domain_children and nr_populated_threaded_children, but
> > cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
> > there are children is cgroup_destroy_locked()'s concern.
> …
> 
> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> The only issue I see is if I delay the irq_work callback by a second.
> Other than that, I don't see any problems.

What issue do you see when delaying it by a second? Just things being slowed
down?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
       [not found] <20260325172348.1836430-1-tj@kernel.org>
@ 2026-03-26  0:09 ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-03-26  0:09 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Sebastian Andrzej Siewior, Michal Koutny, Johannes Weiner

Applied to cgroup/for-7.0-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
  2026-03-26  0:02   ` Tejun Heo
@ 2026-03-26  7:35     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-26  7:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

On 2026-03-25 14:02:05 [-1000], Tejun Heo wrote:
> > The only issue I see is if I delay the irq_work callback by a second.
> > Other than that, I don't see any problems.
> 
> What issue do you see when delaying it by a second? Just things being slowed
> down?

This is during boot:

[  OK  ] Mounted sys-kernel-debug.mount - Kernel Debug File System.
[  OK  ] Mounted sys-kernel-tracing.mount - Kernel Trace File System.
[  OK  ] Mounted tmp.mount - Temporary Directory /tmp.
[   20.845878] INFO: task systemd:1 blocked for more than 10 seconds.
[   20.845885]       Not tainted 7.0.0-rc5+ #178
[   20.845887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[   20.845888] task:systemd         state:D stack:0     pid:1     tgid:1     ppid:0      task_flags:0x400100 flags:0x00080000
[   20.845906] Call Trace:
[   20.845911]  <TASK>
[   20.845915]  __schedule+0x3db/0xf90
[   20.845947]  schedule+0x27/0xd0
[   20.845950]  cgroup_drain_dying+0x9b/0x190
[   20.845971]  cgroup_rmdir+0x2d/0x100
[   20.845980]  kernfs_iop_rmdir+0x6a/0xd0
[   20.845993]  vfs_rmdir+0x11a/0x280
[   20.846002]  filename_rmdir+0x16f/0x1e0
[   20.846009]  __x64_sys_rmdir+0x28/0x40
[   20.846015]  do_syscall_64+0x119/0x5a0
[   20.846152]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   20.846158] RIP: 0033:0x7ff495627337
[   20.846164] RSP: 002b:00007ffd7efa66f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[   20.846170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff495627337
[   20.846172] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00005646ca9583a0
[   20.846173] RBP: 00005646ca9583a0 R08: 000000000000000c R09: 0000000000000000
[   20.846174] R10: 0000000000000000 R11: 0000000000000246 R12: 00005646ca957ac0
[   20.846175] R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000000
[   20.846178]  </TASK>

It does not recover. Therefore I think there might be another race
lurking. This is what I talk about:

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -7112,9 +7112,9 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
  * irq_work to allow batching while ensuring timely completion.
  */
 static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks);
-static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork);
+static DEFINE_PER_CPU(struct delayed_work, cgrp_delayed_tasks_iwork);
 
-static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
+static void cgrp_dead_tasks_iwork_fn(struct work_struct *iwork)
 {
 	struct llist_node *lnode;
 	struct task_struct *task, *next;
@@ -7131,9 +7131,11 @@ static void __init cgroup_rt_init(void)
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
+		struct delayed_work *dwork;
+
 		init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu));
-		per_cpu(cgrp_dead_tasks_iwork, cpu) =
-			IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn);
+		dwork = &per_cpu(cgrp_delayed_tasks_iwork, cpu);
+		INIT_DELAYED_WORK(dwork, cgrp_dead_tasks_iwork_fn);
 	}
 }
 
@@ -7141,7 +7143,7 @@ void cgroup_task_dead(struct task_struct *task)
 {
 	get_task_struct(task);
 	llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
-	irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
+	schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
 }
 #else	/* CONFIG_PREEMPT_RT */
 static void __init cgroup_rt_init(void) {}

> Thanks.
> 

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-26  7:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260325172348.1836430-1-tj@kernel.org>
2026-03-26  0:09 ` [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition Tejun Heo
2026-03-25 17:23 Tejun Heo
2026-03-25 17:30 ` Tejun Heo
2026-03-25 18:06 ` Sebastian Andrzej Siewior
2026-03-26  0:02   ` Tejun Heo
2026-03-26  7:35     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox