public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir
@ 2026-03-23 20:02 Tejun Heo
  2026-03-24  9:04 ` Sebastian Andrzej Siewior
  2026-03-24 20:24 ` Tejun Heo
  0 siblings, 2 replies; 5+ messages in thread
From: Tejun Heo @ 2026-03-23 20:02 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Sebastian Andrzej Siewior, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot, Tejun Heo

a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.

Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().

v2: cgroup_is_populated() true to false transition happens under css_set_lock
    not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
    missed wakeups (Sebastian).

Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
---
 include/linux/cgroup-defs.h |  3 ++
 kernel/cgroup/cgroup.c      | 86 +++++++++++++++++++++++++++++++++++--
 2 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb92f5c169ca..7f87399938fa 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -609,6 +609,9 @@ struct cgroup {
 	/* used to wait for offlining of csses */
 	wait_queue_head_t offline_waitq;
 
+	/* used by cgroup_rmdir() to wait for dying tasks to leave */
+	wait_queue_head_t dying_populated_waitq;
+
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 01fc2a93f3ef..2163054e1aa6 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2126,6 +2126,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 #endif
 
 	init_waitqueue_head(&cgrp->offline_waitq);
+	init_waitqueue_head(&cgrp->dying_populated_waitq);
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
 }
 
@@ -6224,6 +6225,76 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	return 0;
 };
 
+/**
+ * cgroup_drain_dying - wait for dying tasks to leave before rmdir
+ * @cgrp: the cgroup being removed
+ *
+ * The PF_EXITING filter in css_task_iter_advance() hides exiting tasks from
+ * cgroup.procs so that userspace (e.g. systemd) doesn't see tasks that have
+ * already been reaped via waitpid(). However, the populated counter
+ * (nr_populated_csets) is only decremented when the task later passes through
+ * cgroup_task_dead() in finish_task_switch(). This creates a window where
+ * cgroup.procs appears empty but cgroup_is_populated() is still true, causing
+ * rmdir to fail with -EBUSY.
+ *
+ * This function bridges that gap. If the cgroup is populated but all remaining
+ * tasks have PF_EXITING set, we wait for cgroup_task_dead() to process them.
+ * Tasks are removed from the cgroup's css_set in cgroup_task_dead() called from
+ * finish_task_switch(). As the window between PF_EXITING and cgroup_task_dead()
+ * is short, the number of PF_EXITING tasks on the list is small and the wait
+ * is brief.
+ *
+ * Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
+ * retry the full check from scratch.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+static int cgroup_drain_dying(struct cgroup *cgrp)
+	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	DEFINE_WAIT(wait);
+
+	lockdep_assert_held(&cgroup_mutex);
+retry:
+	if (!cgroup_is_populated(cgrp))
+		return 0;
+
+	/* Same iterator as cgroup.threads - if any task is visible, it's busy */
+	css_task_iter_start(&cgrp->self, 0, &it);
+	task = css_task_iter_next(&it);
+	css_task_iter_end(&it);
+
+	if (task)
+		return -EBUSY;
+
+	/*
+	 * All remaining tasks are PF_EXITING and will pass through
+	 * cgroup_task_dead() shortly. Wait for a kick and retry.
+	 *
+	 * cgroup_is_populated() can't transition from false to true while
+	 * we're holding cgroup_mutex, but the true to false transition
+	 * happens under css_set_lock (via cgroup_task_dead()). We must
+	 * retest and prepare_to_wait() under css_set_lock. Otherwise, the
+	 * transition can happen between our first test and
+	 * prepare_to_wait(), and we sleep with no one to wake us.
+	 */
+	spin_lock_irq(&css_set_lock);
+	if (!cgroup_is_populated(cgrp)) {
+		spin_unlock_irq(&css_set_lock);
+		return 0;
+	}
+	prepare_to_wait(&cgrp->dying_populated_waitq, &wait,
+			TASK_UNINTERRUPTIBLE);
+	spin_unlock_irq(&css_set_lock);
+	mutex_unlock(&cgroup_mutex);
+	schedule();
+	finish_wait(&cgrp->dying_populated_waitq, &wait);
+	mutex_lock(&cgroup_mutex);
+	goto retry;
+}
+
 int cgroup_rmdir(struct kernfs_node *kn)
 {
 	struct cgroup *cgrp;
@@ -6233,9 +6304,12 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
-	ret = cgroup_destroy_locked(cgrp);
-	if (!ret)
-		TRACE_CGROUP_PATH(rmdir, cgrp);
+	ret = cgroup_drain_dying(cgrp);
+	if (!ret) {
+		ret = cgroup_destroy_locked(cgrp);
+		if (!ret)
+			TRACE_CGROUP_PATH(rmdir, cgrp);
+	}
 
 	cgroup_kn_unlock(kn);
 	return ret;
@@ -6995,6 +7069,7 @@ void cgroup_task_exit(struct task_struct *tsk)
 
 static void do_cgroup_task_dead(struct task_struct *tsk)
 {
+	struct cgrp_cset_link *link;
 	struct css_set *cset;
 	unsigned long flags;
 
@@ -7008,6 +7083,11 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
 	if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live))
 		list_add_tail(&tsk->cg_list, &cset->dying_tasks);
 
+	/* kick cgroup_drain_dying() waiters, see cgroup_rmdir() */
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
+		if (waitqueue_active(&link->cgrp->dying_populated_waitq))
+			wake_up(&link->cgrp->dying_populated_waitq);
+
 	if (dl_task(tsk))
 		dec_dl_tasks_cs(tsk);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-23 20:02 [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir Tejun Heo
@ 2026-03-24  9:04 ` Sebastian Andrzej Siewior
  2026-03-24 20:17   ` Tejun Heo
  2026-03-24 20:24 ` Tejun Heo
  1 sibling, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-24  9:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

On 2026-03-23 10:02:05 [-1000], Tejun Heo wrote:
> a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
> tasks from cgroup.procs so that systemd doesn't see tasks that have already
> been reaped via waitpid(). However, the populated counter (nr_populated_csets)
> is only decremented when the task later passes through cgroup_task_dead() in
> finish_task_switch(). This means cgroup.procs can appear empty while the
> cgroup is still populated, causing rmdir to fail with -EBUSY.
> 
> Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
> cgroup is populated but all remaining tasks have PF_EXITING set (the task
> iterator returns none due to the existing filter), wait for a kick from
> cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
> cgroup's css_set between PF_EXITING assertion in do_exit() and
> cgroup_task_dead() in finish_task_switch().
> 
> v2: cgroup_is_populated() true to false transition happens under css_set_lock
>     not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
>     missed wakeups (Sebastian).
> 
> Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Bert Karwatzki <spasswolf@web.de>
> Cc: Michal Koutny <mkoutny@suse.com>
> Cc: cgroups@vger.kernel.org

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

As mentioned in the other email, if I 
-       irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
+       schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), 1 * HZ);

then I hung at boot because it rmdir() a cgroup with a task in Z. It
might suggest a race because systemd might missed a task.
But this fixes the other issue so.

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-24  9:04 ` Sebastian Andrzej Siewior
@ 2026-03-24 20:17   ` Tejun Heo
  2026-03-25 11:52     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2026-03-24 20:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

Hello,

On Tue, Mar 24, 2026 at 10:04:02AM +0100, Sebastian Andrzej Siewior wrote:
...
> As mentioned in the other email, if I 
> -       irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
> +       schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), 1 * HZ);
> 
> then I hung at boot because it rmdir() a cgroup with a task in Z. It
> might suggest a race because systemd might missed a task.
> But this fixes the other issue so.

Just did 100 boot test w/ 1s delay added as above but the problem didn't
reproduce. Can't reproduce with cgroup create / populate / depopulate /
rmdir stress tests either. I did hit 1s delay propagating through but that
wasn't a dead lock. The code is not great. It'd be better to just keep
css_set_lock held while iterating too.

I'll apply this for now. Can you please try to reproduce the problem with
the patches applied? How reliably does it reproduce? How is it stuck? Are
tasks waiting on the waitq indefinitely with populated stuck at 1?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-23 20:02 [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir Tejun Heo
  2026-03-24  9:04 ` Sebastian Andrzej Siewior
@ 2026-03-24 20:24 ` Tejun Heo
  1 sibling, 0 replies; 5+ messages in thread
From: Tejun Heo @ 2026-03-24 20:24 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Sebastian Andrzej Siewior, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

Applied to cgroup/for-7.0-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-24 20:17   ` Tejun Heo
@ 2026-03-25 11:52     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-25 11:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	Johannes Weiner, kernel test robot

On 2026-03-24 10:17:36 [-1000], Tejun Heo wrote:
> Hello,
Hi,

> I'll apply this for now. Can you please try to reproduce the problem with
> the patches applied? How reliably does it reproduce? How is it stuck? Are
> tasks waiting on the waitq indefinitely with populated stuck at 1?

With the patches applied the test suite works. But now I accidentally did
poweroff :

| [  OK  ] Finished systemd-poweroff.service - System Power Off.
| [  OK  ] Reached target poweroff.target - System Power Off.
| INFO: task systemd:1 blocked for more than 10 seconds.
|       Not tainted 7.0.0-rc5+ #167
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
| task:systemd         state:D stack:0     pid:1     tgid:1     ppid:0      task_flags:0x400100 flags:0x00080000
| Call Trace:
|  <TASK>
|  __schedule+0x3d4/0xf80
|  schedule+0x27/0xd0
|  cgroup_drain_dying+0xb7/0x1b0
|  cgroup_rmdir+0x2f/0x100
|  kernfs_iop_rmdir+0x6a/0xd0
|  vfs_rmdir+0x11a/0x280
|  filename_rmdir+0x16f/0x1e0
|  __x64_sys_unlinkat+0x58/0x70
|  do_syscall_64+0x119/0x5a0
|  entry_SYSCALL_64_after_hwframe+0x76/0x7e
| RIP: 0033:0x7f06b8b7fc17
| RSP: 002b:00007ffe37f96c98 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
| RAX: ffffffffffffffda RBX: 000055ee562eba40 RCX: 00007f06b8b7fc17
| RDX: 0000000000000200 RSI: 000055ee562eba53 RDI: 000000000000000c
| RBP: 000000000000000d R08: 000055ee562eba40 R09: 0000000000000000
| R10: 00000000000007a0 R11: 0000000000000206 R12: 000055ee5639daa0
| R13: 0000000000000000 R14: 000055ee562eba53 R15: 000055ee562ec030
|  </TASK>

This is not PREEMPT_RT and I can reproduce it also as of commit
6680c162b4850 ("selftests/cgroup: Don't require synchronous populated
update on task exit").
The reboot command is also affected in the same way. systemd is 260-1 in
case it should matter.

> Thanks.
> 
Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-25 11:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23 20:02 [PATCH v2] cgroup: Wait for dying tasks to leave on rmdir Tejun Heo
2026-03-24  9:04 ` Sebastian Andrzej Siewior
2026-03-24 20:17   ` Tejun Heo
2026-03-25 11:52     ` Sebastian Andrzej Siewior
2026-03-24 20:24 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox