From: Tejun Heo <tj@kernel.org>
To: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Bert Karwatzki <spasswolf@web.de>,
Michal Koutny <mkoutny@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
kernel test robot <oliver.sang@intel.com>
Subject: [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition
Date: Wed, 25 Mar 2026 07:23:48 -1000 [thread overview]
Message-ID: <68d8881fd985a410c0f619f009334c28@kernel.org> (raw)
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 42 ++++++++++++++++++++++--------------------
1 file changed, 22 insertions(+), 20 deletions(-)
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6229,20 +6229,22 @@ static int cgroup_destroy_locked(struct
* cgroup_drain_dying - wait for dying tasks to leave before rmdir
* @cgrp: the cgroup being removed
*
- * The PF_EXITING filter in css_task_iter_advance() hides exiting tasks from
- * cgroup.procs so that userspace (e.g. systemd) doesn't see tasks that have
- * already been reaped via waitpid(). However, the populated counter
- * (nr_populated_csets) is only decremented when the task later passes through
+ * cgroup.procs and cgroup.threads use css_task_iter which filters out
+ * PF_EXITING tasks so that userspace doesn't see tasks that have already been
+ * reaped via waitpid(). However, cgroup_has_tasks() - which tests whether the
+ * cgroup has non-empty css_sets - is only updated when dying tasks pass through
* cgroup_task_dead() in finish_task_switch(). This creates a window where
- * cgroup.procs appears empty but cgroup_is_populated() is still true, causing
- * rmdir to fail with -EBUSY.
+ * cgroup.procs reads empty but cgroup_has_tasks() is still true, making rmdir
+ * fail with -EBUSY from cgroup_destroy_locked() even though userspace sees no
+ * tasks.
*
- * This function bridges that gap. If the cgroup is populated but all remaining
- * tasks have PF_EXITING set, we wait for cgroup_task_dead() to process them.
- * Tasks are removed from the cgroup's css_set in cgroup_task_dead() called from
- * finish_task_switch(). As the window between PF_EXITING and cgroup_task_dead()
- * is short, the number of PF_EXITING tasks on the list is small and the wait
- * is brief.
+ * This function aligns cgroup_has_tasks() with what userspace can observe. If
+ * cgroup_has_tasks() but the task iterator sees nothing (all remaining tasks are
+ * PF_EXITING), we wait for cgroup_task_dead() to finish processing them. As the
+ * window between PF_EXITING and cgroup_task_dead() is short, the wait is brief.
+ *
+ * This function only concerns itself with this cgroup's own dying tasks.
+ * Whether the cgroup has children is cgroup_destroy_locked()'s problem.
*
* Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
* retry the full check from scratch.
@@ -6258,7 +6260,7 @@ static int cgroup_drain_dying(struct cgr
lockdep_assert_held(&cgroup_mutex);
retry:
- if (!cgroup_is_populated(cgrp))
+ if (!cgroup_has_tasks(cgrp))
return 0;
/* Same iterator as cgroup.threads - if any task is visible, it's busy */
@@ -6273,15 +6275,15 @@ retry:
* All remaining tasks are PF_EXITING and will pass through
* cgroup_task_dead() shortly. Wait for a kick and retry.
*
- * cgroup_is_populated() can't transition from false to true while
- * we're holding cgroup_mutex, but the true to false transition
- * happens under css_set_lock (via cgroup_task_dead()). We must
- * retest and prepare_to_wait() under css_set_lock. Otherwise, the
- * transition can happen between our first test and
- * prepare_to_wait(), and we sleep with no one to wake us.
+ * cgroup_has_tasks() can't transition from false to true while we're
+ * holding cgroup_mutex, but the true to false transition happens
+ * under css_set_lock (via cgroup_task_dead()). We must retest and
+ * prepare_to_wait() under css_set_lock. Otherwise, the transition
+ * can happen between our first test and prepare_to_wait(), and we
+ * sleep with no one to wake us.
*/
spin_lock_irq(&css_set_lock);
- if (!cgroup_is_populated(cgrp)) {
+ if (!cgroup_has_tasks(cgrp)) {
spin_unlock_irq(&css_set_lock);
return 0;
}
next reply other threads:[~2026-03-25 17:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 17:23 Tejun Heo [this message]
2026-03-25 17:30 ` [PATCH v3 cgroup/for-7.0-fixes] cgroup: Fix cgroup_drain_dying() testing the wrong condition Tejun Heo
2026-03-25 18:06 ` Sebastian Andrzej Siewior
2026-03-26 0:02 ` Tejun Heo
2026-03-26 7:35 ` Sebastian Andrzej Siewior
2026-03-27 20:18 ` Tejun Heo
2026-04-01 13:27 ` Sebastian Andrzej Siewior
[not found] <20260325172348.1836430-1-tj@kernel.org>
2026-03-26 0:09 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=68d8881fd985a410c0f619f009334c28@kernel.org \
--to=tj@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=oliver.sang@intel.com \
--cc=spasswolf@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.