[PATCH] cgroup: Wait for dying tasks to leave on rmdir

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cgroup: Wait for dying tasks to leave on rmdir
@ 2026-03-23  3:58 Tejun Heo
  2026-03-23 11:32 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-03-23  3:58 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Sebastian Andrzej Siewior, Bert Karwatzki, Michal Koutny,
	kernel test robot, Tejun Heo

a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.

Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().

Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
---
 include/linux/cgroup-defs.h |  3 ++
 kernel/cgroup/cgroup.c      | 73 +++++++++++++++++++++++++++++++++++--
 2 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb92f5c169ca..7f87399938fa 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -609,6 +609,9 @@ struct cgroup {
 	/* used to wait for offlining of csses */
 	wait_queue_head_t offline_waitq;
 
+	/* used by cgroup_rmdir() to wait for dying tasks to leave */
+	wait_queue_head_t dying_populated_waitq;
+
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 01fc2a93f3ef..49c5622a1a63 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2126,6 +2126,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 #endif
 
 	init_waitqueue_head(&cgrp->offline_waitq);
+	init_waitqueue_head(&cgrp->dying_populated_waitq);
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
 }
 
@@ -6224,6 +6225,63 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	return 0;
 };
 
+/**
+ * cgroup_drain_dying - wait for dying tasks to leave before rmdir
+ * @cgrp: the cgroup being removed
+ *
+ * The PF_EXITING filter in css_task_iter_advance() hides exiting tasks from
+ * cgroup.procs so that userspace (e.g. systemd) doesn't see tasks that have
+ * already been reaped via waitpid(). However, the populated counter
+ * (nr_populated_csets) is only decremented when the task later passes through
+ * cgroup_task_dead() in finish_task_switch(). This creates a window where
+ * cgroup.procs appears empty but cgroup_is_populated() is still true, causing
+ * rmdir to fail with -EBUSY.
+ *
+ * This function bridges that gap. If the cgroup is populated but all remaining
+ * tasks have PF_EXITING set, we wait for cgroup_task_dead() to process them.
+ * Tasks are removed from the cgroup's css_set in cgroup_task_dead() called from
+ * finish_task_switch(). As the window between PF_EXITING and cgroup_task_dead()
+ * is short, the number of PF_EXITING tasks on the list is small and the wait
+ * is brief.
+ *
+ * Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
+ * retry the full check from scratch.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+static int cgroup_drain_dying(struct cgroup *cgrp)
+	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	DEFINE_WAIT(wait);
+
+	lockdep_assert_held(&cgroup_mutex);
+retry:
+	if (!cgroup_is_populated(cgrp))
+		return 0;
+
+	/* Same iterator as cgroup.threads - if any task is visible, it's busy */
+	css_task_iter_start(&cgrp->self, 0, &it);
+	task = css_task_iter_next(&it);
+	css_task_iter_end(&it);
+
+	if (task)
+		return -EBUSY;
+
+	/*
+	 * All remaining tasks are PF_EXITING and will pass through
+	 * cgroup_task_dead() shortly. Wait for a kick and retry.
+	 */
+	prepare_to_wait(&cgrp->dying_populated_waitq, &wait,
+			TASK_UNINTERRUPTIBLE);
+	mutex_unlock(&cgroup_mutex);
+	schedule();
+	finish_wait(&cgrp->dying_populated_waitq, &wait);
+	mutex_lock(&cgroup_mutex);
+	goto retry;
+}
+
 int cgroup_rmdir(struct kernfs_node *kn)
 {
 	struct cgroup *cgrp;
@@ -6233,9 +6291,12 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
-	ret = cgroup_destroy_locked(cgrp);
-	if (!ret)
-		TRACE_CGROUP_PATH(rmdir, cgrp);
+	ret = cgroup_drain_dying(cgrp);
+	if (!ret) {
+		ret = cgroup_destroy_locked(cgrp);
+		if (!ret)
+			TRACE_CGROUP_PATH(rmdir, cgrp);
+	}
 
 	cgroup_kn_unlock(kn);
 	return ret;
@@ -6995,6 +7056,7 @@ void cgroup_task_exit(struct task_struct *tsk)
 
 static void do_cgroup_task_dead(struct task_struct *tsk)
 {
+	struct cgrp_cset_link *link;
 	struct css_set *cset;
 	unsigned long flags;
 
@@ -7008,6 +7070,11 @@ static void do_cgroup_task_dead(struct task_struct *tsk)
 	if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live))
 		list_add_tail(&tsk->cg_list, &cset->dying_tasks);
 
+	/* kick cgroup_drain_dying() waiters, see cgroup_rmdir() */
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
+		if (waitqueue_active(&link->cgrp->dying_populated_waitq))
+			wake_up(&link->cgrp->dying_populated_waitq);
+
 	if (dl_task(tsk))
 		dec_dl_tasks_cs(tsk);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-23  3:58 [PATCH] cgroup: Wait for dying tasks to leave on rmdir Tejun Heo
@ 2026-03-23 11:32 ` Sebastian Andrzej Siewior
  2026-03-23 19:55   ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-23 11:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	kernel test robot

On 2026-03-22 17:58:06 [-1000], Tejun Heo wrote:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6224,6 +6225,63 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
…
> +static int cgroup_drain_dying(struct cgroup *cgrp)
> +	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	DEFINE_WAIT(wait);
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +retry:
> +	if (!cgroup_is_populated(cgrp))
> +		return 0;
> +
> +	/* Same iterator as cgroup.threads - if any task is visible, it's busy */
> +	css_task_iter_start(&cgrp->self, 0, &it);
> +	task = css_task_iter_next(&it);
> +	css_task_iter_end(&it);
> +
> +	if (task)
> +		return -EBUSY;
> +
> +	/*
> +	 * All remaining tasks are PF_EXITING and will pass through
> +	 * cgroup_task_dead() shortly. Wait for a kick and retry.
> +	 */
> +	prepare_to_wait(&cgrp->dying_populated_waitq, &wait,
> +			TASK_UNINTERRUPTIBLE);
> +	mutex_unlock(&cgroup_mutex);

I had to add here
	if (cgroup_is_populated(cgrp))

> +	schedule();

I saw instances on PREEMPT_RT where the above cgroup_is_populated()
reported true due to cgrp->nr_populated_csets = 1, the following
iterator returned NULL but in that time do_cgroup_task_dead() saw no
waiter and continued without a wake_up and then the following schedule()
hung.
There is no serialisation between this wait/ check and latter wake. An
alternative would be to check and prepare_to_wait() under css_set_lock.

> +	finish_wait(&cgrp->dying_populated_waitq, &wait);
> +	mutex_lock(&cgroup_mutex);
> +	goto retry;
> +}

Then I added my RCU patch. This led to a problem already during boot up
(didn't manage to get to the test suite).

systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup.
It hangs in cgroup_drain_dying() because nr_populated_csets is still 1.
modprobe-1044 is still there in Z so the cgroup removal didn't get there
yet. That irq_work was quicker than RCU in this case. This can be
reproduced without RCU by
-       irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
+       schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);

So there is always a one second delay. If I give up waiting after 10secs
then it boots eventually and there are no zombies around. The test_core
seems to complete…

Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ
tick. test_core then complains just with
| not ok 7 test_cgcore_populated

everything else passes. With schedule_work() (as in right away) all
tests pass including test_stress.sh

Is there another race lurking?

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-23 11:32 ` Sebastian Andrzej Siewior
@ 2026-03-23 19:55   ` Tejun Heo
  2026-03-24  8:21     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-03-23 19:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	kernel test robot

Hello,

On Mon, Mar 23, 2026 at 12:32:52PM +0100, Sebastian Andrzej Siewior wrote:
...
> I saw instances on PREEMPT_RT where the above cgroup_is_populated()
> reported true due to cgrp->nr_populated_csets = 1, the following
> iterator returned NULL but in that time do_cgroup_task_dead() saw no
> waiter and continued without a wake_up and then the following schedule()
> hung.

Ah, right, false->true is protected by cgroup_mutex but true->false is only
by css_set_lock. It should check populated again with css_set_lock held and
then do prepare_to_wait().

> There is no serialisation between this wait/ check and latter wake. An
> alternative would be to check and prepare_to_wait() under css_set_lock.

Yeap.

> > +	finish_wait(&cgrp->dying_populated_waitq, &wait);
> > +	mutex_lock(&cgroup_mutex);
> > +	goto retry;
> > +}
> 
> Then I added my RCU patch. This led to a problem already during boot up
> (didn't manage to get to the test suite).

Is that the patch to move cgroup_task_dead() to delayed_put_task_struct()? I
don't think we can delay populated state update till usage count reaches
zero. e.g. bpf_task_acquire() can be used by arbitrary bpf programs and will
pin the usage count indefinitely delaying populated state update. Similar to
delaying the event to free path, you can construct a deadlock scenario too.

> systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup.
> It hangs in cgroup_drain_dying() because nr_populated_csets is still 1.
> modprobe-1044 is still there in Z so the cgroup removal didn't get there
> yet. That irq_work was quicker than RCU in this case. This can be
> reproduced without RCU by

Isn't this the exact scenario? systemd is the one who should reap and drop
the usage count but it's waiting for rmdir() to finish which can't finish
due to the usage count which hasn't been reapted by systemd? We can't
interlock these two. They have to make progress independently.

> -       irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
> +       schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
> 
> So there is always a one second delay. If I give up waiting after 10secs
> then it boots eventually and there are no zombies around. The test_core
> seems to complete…
> 
> Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ
> tick. test_core then complains just with
> | not ok 7 test_cgcore_populated

The test is assuming that waitpid() success guarantees cgroup !populated
event. While before all these changes, that held, it wasn't intentional and
the test just picked up on arbitrary ordering. I'll just remove that
particular test.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir
  2026-03-23 19:55   ` Tejun Heo
@ 2026-03-24  8:21     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-24  8:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Michal Koutny,
	kernel test robot

On 2026-03-23 09:55:40 [-1000], Tejun Heo wrote:
> Hello,
Hi,

> > Then I added my RCU patch. This led to a problem already during boot up
> > (didn't manage to get to the test suite).
> 
> Is that the patch to move cgroup_task_dead() to delayed_put_task_struct()? I
> don't think we can delay populated state update till usage count reaches
> zero. e.g. bpf_task_acquire() can be used by arbitrary bpf programs and will
> pin the usage count indefinitely delaying populated state update. Similar to
> delaying the event to free path, you can construct a deadlock scenario too.

Okay, then. I expected it to be limited window within a bpf program or
the sched_ext.

> > systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup.
> > It hangs in cgroup_drain_dying() because nr_populated_csets is still 1.
> > modprobe-1044 is still there in Z so the cgroup removal didn't get there
> > yet. That irq_work was quicker than RCU in this case. This can be
> > reproduced without RCU by
> 
> Isn't this the exact scenario? systemd is the one who should reap and drop
> the usage count but it's waiting for rmdir() to finish which can't finish
> due to the usage count which hasn't been reapted by systemd? We can't
> interlock these two. They have to make progress independently.

But nobody is holding it back. For some reason systemd-1 did not reap
modprobe-1044 first but went first for the rmdir(). I noticed it with
RCU first but it was also there after delayed the cleanup by one second
without RCU.

> > -       irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
> > +       schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
> > 
> > So there is always a one second delay. If I give up waiting after 10secs
> > then it boots eventually and there are no zombies around. The test_core
> > seems to complete…
> > 
> > Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ
> > tick. test_core then complains just with
> > | not ok 7 test_cgcore_populated
> 
> The test is assuming that waitpid() success guarantees cgroup !populated
> event. While before all these changes, that held, it wasn't intentional and
> the test just picked up on arbitrary ordering. I'll just remove that
> particular test.

okay. Thanks.

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-24  8:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23  3:58 [PATCH] cgroup: Wait for dying tasks to leave on rmdir Tejun Heo
2026-03-23 11:32 ` Sebastian Andrzej Siewior
2026-03-23 19:55   ` Tejun Heo
2026-03-24  8:21     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox