* [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral
@ 2026-05-05 0:51 Tejun Heo
2026-05-05 0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-05 0:51 UTC (permalink / raw)
To: Johannes Weiner, Michal Koutný
Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo
Hello,
Follow-up to 93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir
until cgroup is depopulated") in cgroup/for-7.1-fixes, assumed merged
into cgroup/for-7.2.
That commit fixed the rmdir race by deferring kill_css_finish() at the
cgroup level so ->css_offline() runs only after PF_EXITING tasks have
left the cgroup. cgroup_apply_control_disable() has the same race shape
(PF_EXITING tasks pinning the dying controller's css while
->css_offline() runs), but fixing it requires switching
cgroup_lock_and_drain_offline()'s wait predicate from
percpu_ref_is_dying() to css_is_dying() to cover the deferral window -
too invasive for -stable, hence -7.2.
This series:
- Replaces the cgroup-level deferral with a per-subsys-css mechanism
so each controller css independently defers kill_css_finish() until
its own subtree drains.
- Pairs smp_mb()s in kill_css_sync() and css_update_populated() to
interlock the synchronous- and deferred-fire decisions.
- Wires cgroup_apply_control_disable() through the per-css deferral
and switches drain_offline to wait on css_is_dying.
After the predicate switch, a +ctrl re-enable issued while a deferred
-ctrl is still draining blocks in TASK_UNINTERRUPTIBLE on offline_waitq
until the dying css drains. Pre-existing for rmdir; the apply path now
joins it.
Verified by 200001 iterations of repro-a72f73c4dd9b, per-commit
deterministic repros for the bug-chain commits, 5292 iterations of
stress-disable-control, and targeted ftrace coverage of rmdir,
apply_disable, and nested-destroy paths. No warnings or stalls.
Based on cgroup/for-7.2 (d8769544bde5) with cgroup/for-7.1-fixes
(93618edf7538) assumed merged.
Patches:
[PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h
[PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
[PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state
[PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral
[PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-drain-for-7.2
include/linux/cgroup-defs.h | 30 ++++---
include/linux/cgroup.h | 27 ++++++-
kernel/cgroup/cgroup.c | 188 +++++++++++++++++++++++++-------------------
kernel/cgroup/cpuset-v1.c | 2 +-
kernel/cgroup/cpuset.c | 2 +-
5 files changed, 148 insertions(+), 101 deletions(-)
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h
2026-05-05 0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
@ 2026-05-05 0:51 ` Tejun Heo
2026-05-05 0:51 ` [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE Tejun Heo
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-05 0:51 UTC (permalink / raw)
To: Johannes Weiner, Michal Koutný
Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo
cpuset reads cs->css.cgroup->nr_populated_csets directly in two places to
test whether a cgroup has tasks. cgroup.c already has a matching helper,
cgroup_has_tasks(). Move it to cgroup.h as static inline and use that
instead. This is to prepare for relocation of cgroup->nr_populated_csets. No
semantic change.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 5 +++++
kernel/cgroup/cgroup.c | 5 -----
kernel/cgroup/cpuset-v1.c | 2 +-
kernel/cgroup/cpuset.c | 2 +-
4 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e52160e85af4..ceb87507667e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -639,6 +639,11 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
}
+static inline bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+ return cgrp->nr_populated_csets;
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_is_populated(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index bd10a7e2f9c5..7a94c2ea1036 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -376,11 +376,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
spin_unlock_bh(&cgroup_idr_lock);
}
-static bool cgroup_has_tasks(struct cgroup *cgrp)
-{
- return cgrp->nr_populated_csets;
-}
-
static bool cgroup_is_threaded(struct cgroup *cgrp)
{
return cgrp->dom_cgrp != cgrp;
diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
index 7308e9b02495..3e9968dd91e9 100644
--- a/kernel/cgroup/cpuset-v1.c
+++ b/kernel/cgroup/cpuset-v1.c
@@ -312,7 +312,7 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
* This is full cgroup operation which will also call back into
* cpuset. Execute it asynchronously using workqueue.
*/
- if (is_empty && cs->css.cgroup->nr_populated_csets &&
+ if (is_empty && cgroup_has_tasks(cs->css.cgroup) &&
css_tryget_online(&cs->css)) {
struct cpuset_remove_tasks_struct *s;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3a081a07c6d..a76006b62b9c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -432,7 +432,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
* nr_populated_domain_children may include populated
* csets from descendants that are partitions.
*/
- if (cs->css.cgroup->nr_populated_csets ||
+ if (cgroup_has_tasks(cs->css.cgroup) ||
cs->attach_in_progress)
return true;
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
2026-05-05 0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
2026-05-05 0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
@ 2026-05-05 0:51 ` Tejun Heo
2026-05-05 0:51 ` [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state Tejun Heo
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-05 0:51 UTC (permalink / raw)
To: Johannes Weiner, Michal Koutný
Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo
cgroup_update_populated() updates nr_populated_csets,
nr_populated_domain_children, and nr_populated_threaded_children under
css_set_lock, but cgroup_has_tasks(), cgroup_is_populated(), and
cgroup_can_be_thread_root() read them without holding it. Use
READ_ONCE/WRITE_ONCE.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup.h | 21 +++++++++++++++++----
kernel/cgroup/cgroup.c | 11 +++++++----
2 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ceb87507667e..9f8bef8f3a60 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -639,16 +639,29 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
}
+/*
+ * Populated counters: writes happen under css_set_lock. The accessors below
+ * may read unlocked. What an unpopulated result means depends on context:
+ *
+ * - No lock held. Just a snapshot. May race with concurrent updates and is
+ * useful only as a hint.
+ *
+ * - cgroup_mutex held. Migration into the cgroup is blocked, so an observed
+ * !populated stays !populated until cgroup_mutex is dropped.
+ *
+ * - CSS_DYING set. The css can no longer be repopulated, so !populated is
+ * sticky once observed.
+ */
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
- return cgrp->nr_populated_csets;
+ return READ_ONCE(cgrp->nr_populated_csets);
}
-/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_is_populated(struct cgroup *cgrp)
{
- return cgrp->nr_populated_csets + cgrp->nr_populated_domain_children +
- cgrp->nr_populated_threaded_children;
+ return READ_ONCE(cgrp->nr_populated_csets) +
+ READ_ONCE(cgrp->nr_populated_domain_children) +
+ READ_ONCE(cgrp->nr_populated_threaded_children);
}
/* returns ino associated with a cgroup */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7a94c2ea1036..d1395784871a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -404,7 +404,7 @@ static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
return false;
/* can only have either domain or threaded children */
- if (cgrp->nr_populated_domain_children)
+ if (READ_ONCE(cgrp->nr_populated_domain_children))
return false;
/* and no domain controllers can be enabled */
@@ -783,12 +783,15 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
bool was_populated = cgroup_is_populated(cgrp);
if (!child) {
- cgrp->nr_populated_csets += adj;
+ WRITE_ONCE(cgrp->nr_populated_csets,
+ cgrp->nr_populated_csets + adj);
} else {
if (cgroup_is_threaded(child))
- cgrp->nr_populated_threaded_children += adj;
+ WRITE_ONCE(cgrp->nr_populated_threaded_children,
+ cgrp->nr_populated_threaded_children + adj);
else
- cgrp->nr_populated_domain_children += adj;
+ WRITE_ONCE(cgrp->nr_populated_domain_children,
+ cgrp->nr_populated_domain_children + adj);
}
if (was_populated == cgroup_is_populated(cgrp))
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state
2026-05-05 0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
2026-05-05 0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
2026-05-05 0:51 ` [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE Tejun Heo
@ 2026-05-05 0:51 ` Tejun Heo
2026-05-05 0:51 ` [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral Tejun Heo
2026-05-05 0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
4 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-05 0:51 UTC (permalink / raw)
To: Johannes Weiner, Michal Koutný
Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo
Later patches replace the cgroup-level finish_destroy_work deferral added
by 93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir until cgroup
is depopulated") with a per-subsys-css deferral. That needs each subsystem
css to track its own populated count. Move the populated counters from
cgroup onto cgroup_subsys_state. cgroup->self is itself a
cgroup_subsys_state and self.parent walks the same chain as cgroup_parent(),
so cgroup_update_populated() generalizes to a single css_update_populated()
taking a css. The cgroup-side bookkeeping runs only when the walk started
from a self css.
Keep nr_populated_{domain,threaded}_children on cgroup. Both sum to
self.nr_populated_children, but staying as dedicated fields to allow readers
like cgroup_can_be_thread_root() unlocked access.
css_set_update_populated() also walks the per-subsys-css chain so each
subsystem css's hierarchical populated count is maintained. No reader
consumes those counts yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup-defs.h | 24 ++++++----
include/linux/cgroup.h | 11 +++--
kernel/cgroup/cgroup.c | 95 +++++++++++++++++++++----------------
3 files changed, 76 insertions(+), 54 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 50a784da7a81..c4929f7bbe5a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -253,6 +253,15 @@ struct cgroup_subsys_state {
*/
int nr_descendants;
+ /*
+ * Hierarchical populated state. For cgroup->self, nr_populated_csets
+ * counts populated csets linked via cgrp_cset_link.
+ * nr_populated_children counts immediate-child csses whose own
+ * populated state is nonzero. Protected by css_set_lock.
+ */
+ int nr_populated_csets;
+ int nr_populated_children;
+
/*
* A singly-linked list of css structures to be rstat flushed.
* This is a scratch field to be used exclusively by
@@ -504,17 +513,12 @@ struct cgroup {
int max_descendants;
/*
- * Each non-empty css_set associated with this cgroup contributes
- * one to nr_populated_csets. The counter is zero iff this cgroup
- * doesn't have any tasks.
- *
- * All children which have non-zero nr_populated_csets and/or
- * nr_populated_children of their own contribute one to either
- * nr_populated_domain_children or nr_populated_threaded_children
- * depending on their type. Each counter is zero iff all cgroups
- * of the type in the subtree proper don't have any tasks.
+ * Domain/threaded split of self.nr_populated_children: each counts
+ * immediate-child cgroups whose subtree is populated and sums to
+ * self.nr_populated_children. Kept as separate fields to allow readers
+ * like cgroup_can_be_thread_root() unlocked access. Protected by
+ * css_set_lock; updated by css_update_populated().
*/
- int nr_populated_csets;
int nr_populated_domain_children;
int nr_populated_threaded_children;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9f8bef8f3a60..c2a8c38d8206 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,14 +654,17 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
*/
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
- return READ_ONCE(cgrp->nr_populated_csets);
+ return READ_ONCE(cgrp->self.nr_populated_csets);
+}
+
+static inline bool css_is_populated(struct cgroup_subsys_state *css)
+{
+ return READ_ONCE(css->nr_populated_csets) || READ_ONCE(css->nr_populated_children);
}
static inline bool cgroup_is_populated(struct cgroup *cgrp)
{
- return READ_ONCE(cgrp->nr_populated_csets) +
- READ_ONCE(cgrp->nr_populated_domain_children) +
- READ_ONCE(cgrp->nr_populated_threaded_children);
+ return css_is_populated(&cgrp->self);
}
/* returns ino associated with a cgroup */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d1395784871a..dd4ea9d83100 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -756,65 +756,70 @@ static bool css_set_populated(struct css_set *cset)
}
/**
- * cgroup_update_populated - update the populated count of a cgroup
- * @cgrp: the target cgroup
- * @populated: inc or dec populated count
- *
- * One of the css_sets associated with @cgrp is either getting its first
- * task or losing the last. Update @cgrp->nr_populated_* accordingly. The
- * count is propagated towards root so that a given cgroup's
- * nr_populated_children is zero iff none of its descendants contain any
- * tasks.
- *
- * @cgrp's interface file "cgroup.populated" is zero if both
- * @cgrp->nr_populated_csets and @cgrp->nr_populated_children are zero and
- * 1 otherwise. When the sum changes from or to zero, userland is notified
- * that the content of the interface file has changed. This can be used to
- * detect when @cgrp and its descendants become populated or empty.
+ * css_update_populated - update the populated state of a css and ancestors
+ * @css: leaf css whose own populated count is changing
+ * @populated: inc or dec
+ *
+ * One of the css_sets pinned by @css is getting its first task or losing the
+ * last. Propagate the transition up the parent chain so that a css's
+ * nr_populated_children is zero iff none of its descendants contain any tasks.
+ *
+ * For a cgroup->self walk, also runs cgroup-side bookkeeping at each level:
+ * domain/threaded child split, deferred-destroy trigger, and notification via
+ * "cgroup.populated" (zero iff cgrp->self has neither populated csets nor
+ * populated children; userland is notified on transitions).
*/
-static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+static void css_update_populated(struct cgroup_subsys_state *css, bool populated)
{
- struct cgroup *child = NULL;
+ struct cgroup_subsys_state *child = NULL;
int adj = populated ? 1 : -1;
lockdep_assert_held(&css_set_lock);
do {
- bool was_populated = cgroup_is_populated(cgrp);
+ /* non-NULL only on the cgroup->self walk */
+ struct cgroup *cgrp = css_is_self(css) ? css->cgroup : NULL;
+ bool was_populated = css_is_populated(css);
if (!child) {
- WRITE_ONCE(cgrp->nr_populated_csets,
- cgrp->nr_populated_csets + adj);
+ WRITE_ONCE(css->nr_populated_csets,
+ css->nr_populated_csets + adj);
} else {
- if (cgroup_is_threaded(child))
- WRITE_ONCE(cgrp->nr_populated_threaded_children,
- cgrp->nr_populated_threaded_children + adj);
- else
- WRITE_ONCE(cgrp->nr_populated_domain_children,
- cgrp->nr_populated_domain_children + adj);
+ WRITE_ONCE(css->nr_populated_children,
+ css->nr_populated_children + adj);
+ if (cgrp) {
+ if (cgroup_is_threaded(child->cgroup))
+ WRITE_ONCE(cgrp->nr_populated_threaded_children,
+ cgrp->nr_populated_threaded_children + adj);
+ else
+ WRITE_ONCE(cgrp->nr_populated_domain_children,
+ cgrp->nr_populated_domain_children + adj);
+ }
}
- if (was_populated == cgroup_is_populated(cgrp))
+ if (was_populated == css_is_populated(css))
break;
/*
* Subtree just emptied below an offlined cgrp. Fire deferred
* destroy. The transition is one-shot.
*/
- if (was_populated && !css_is_online(&cgrp->self)) {
+ if (cgrp && was_populated && !css_is_online(css)) {
cgroup_get(cgrp);
WARN_ON_ONCE(!queue_work(cgroup_offline_wq,
&cgrp->finish_destroy_work));
}
- cgroup1_check_for_release(cgrp);
- TRACE_CGROUP_PATH(notify_populated, cgrp,
- cgroup_is_populated(cgrp));
- cgroup_file_notify(&cgrp->events_file);
+ if (cgrp) {
+ cgroup1_check_for_release(cgrp);
+ TRACE_CGROUP_PATH(notify_populated, cgrp,
+ cgroup_is_populated(cgrp));
+ cgroup_file_notify(&cgrp->events_file);
+ }
- child = cgrp;
- cgrp = cgroup_parent(cgrp);
- } while (cgrp);
+ child = css;
+ css = css->parent;
+ } while (css);
}
/**
@@ -822,17 +827,27 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
* @cset: target css_set
* @populated: whether @cset is populated or depopulated
*
- * @cset is either getting the first task or losing the last. Update the
- * populated counters of all associated cgroups accordingly.
+ * @cset is either getting the first task or losing the last. Update the
+ * populated counters along each linked cgroup's self chain and each
+ * subsystem css that @cset pins.
*/
static void css_set_update_populated(struct css_set *cset, bool populated)
{
struct cgrp_cset_link *link;
+ struct cgroup_subsys *ss;
+ int ssid;
lockdep_assert_held(&css_set_lock);
list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
- cgroup_update_populated(link->cgrp, populated);
+ css_update_populated(&link->cgrp->self, populated);
+
+ for_each_subsys(ss, ssid) {
+ struct cgroup_subsys_state *css = cset->subsys[ssid];
+
+ if (css)
+ css_update_populated(css, populated);
+ }
}
/*
@@ -2190,7 +2205,7 @@ int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask)
hash_for_each(css_set_table, i, cset, hlist) {
link_css_set(&tmp_links, cset, root_cgrp);
if (css_set_populated(cset))
- cgroup_update_populated(root_cgrp, true);
+ css_update_populated(&root_cgrp->self, true);
}
spin_unlock_irq(&css_set_lock);
@@ -6145,7 +6160,7 @@ static void kill_css_finish(struct cgroup_subsys_state *css)
*
* - cgroup_finish_destroy(): kicks the percpu_ref kill via kill_css_finish() on
* each subsystem css. Fires once @cgrp's subtree is fully drained, either
- * inline here or from cgroup_update_populated().
+ * inline here or from css_update_populated().
*
* - The percpu_ref kill chain: css_killed_ref_fn -> css_killed_work_fn ->
* ->css_offline() -> release/free.
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral
2026-05-05 0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
` (2 preceding siblings ...)
2026-05-05 0:51 ` [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state Tejun Heo
@ 2026-05-05 0:51 ` Tejun Heo
2026-05-05 0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
4 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-05 0:51 UTC (permalink / raw)
To: Johannes Weiner, Michal Koutný
Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo
93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir until cgroup is
depopulated") deferred kill_css_finish() at the cgroup level: rmdir waits
for the entire cgroup's populated count to drop to zero, then fires
kill_css_finish() on every subsystem css at once. Replace that with
per-subsys-css deferral. Each subsystem css now tracks its own hierarchical
populated count and independently defers its kill_css_finish() until its own
subtree drains.
The rmdir-race fix carries through unchanged in shape. The dying css's
->css_offline() still waits until no PF_EXITING task references it, and v2's
cgroup-level machinery goes away.
cgroup_apply_control_disable() has the same race shape (PF_EXITING tasks
pinning a css whose ->css_offline() is about to run) and stays synchronous
here. This patch lays the groundwork for fixing it - per-cgroup waiting
can't gate one subsys css being killed while the rest of the cgroup stays
live, but per-css can.
Subtree-wide invariant preserved: a dying ancestor css stays populated
through nr_populated_children until every dying descendant's task drains, so
the walker fires the ancestor's kill_finish_work only after all descendants
have drained.
Add paired smp_mb()s in kill_css_sync() and css_update_populated() to fence
the StoreLoad on (CSS_DYING, populated counter), guaranteeing that either
the walker queues kill_finish_work or the caller fires synchronously.
cgroup_destroy_locked() was implicitly fenced by an unrelated css_set_lock
pair; cgroup_apply_control_disable() in the next patch is not.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/cgroup-defs.h | 6 +--
kernel/cgroup/cgroup.c | 83 +++++++++++++++++++------------------
2 files changed, 46 insertions(+), 43 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c4929f7bbe5a..de2cd6238c2a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -262,6 +262,9 @@ struct cgroup_subsys_state {
int nr_populated_csets;
int nr_populated_children;
+ /* deferred kill_css_finish() queued by css_update_populated() */
+ struct work_struct kill_finish_work;
+
/*
* A singly-linked list of css structures to be rstat flushed.
* This is a scratch field to be used exclusively by
@@ -615,9 +618,6 @@ struct cgroup {
/* used to wait for offlining of csses */
wait_queue_head_t offline_waitq;
- /* defers killing csses after removal until cgroup is depopulated */
- struct work_struct finish_destroy_work;
-
/* used to schedule release agent */
struct work_struct release_agent_work;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index dd4ea9d83100..fa24102535d9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -264,7 +264,6 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret);
static void css_task_iter_skip(struct css_task_iter *it,
struct task_struct *task);
static int cgroup_destroy_locked(struct cgroup *cgrp);
-static void cgroup_finish_destroy(struct cgroup *cgrp);
static void kill_css_sync(struct cgroup_subsys_state *css);
static void kill_css_finish(struct cgroup_subsys_state *css);
static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
@@ -801,13 +800,19 @@ static void css_update_populated(struct cgroup_subsys_state *css, bool populated
break;
/*
- * Subtree just emptied below an offlined cgrp. Fire deferred
- * destroy. The transition is one-shot.
+ * Pair with smp_mb() in kill_css_sync(). Either we observe
+ * CSS_DYING and queue, or the caller observes our decrement
+ * and fires synchronously.
*/
- if (cgrp && was_populated && !css_is_online(css)) {
- cgroup_get(cgrp);
- WARN_ON_ONCE(!queue_work(cgroup_offline_wq,
- &cgrp->finish_destroy_work));
+ smp_mb();
+
+ /*
+ * Subtree just emptied below a dying css. Fire deferred kill.
+ * The transition is one-shot for a dying css.
+ */
+ if (was_populated && css_is_dying(css)) {
+ css_get(css);
+ WARN_ON_ONCE(!queue_work(cgroup_offline_wq, &css->kill_finish_work));
}
if (cgrp) {
@@ -2064,16 +2069,6 @@ static int cgroup_reconfigure(struct fs_context *fc)
return 0;
}
-static void cgroup_finish_destroy_work_fn(struct work_struct *work)
-{
- struct cgroup *cgrp = container_of(work, struct cgroup, finish_destroy_work);
-
- cgroup_lock();
- cgroup_finish_destroy(cgrp);
- cgroup_unlock();
- cgroup_put(cgrp);
-}
-
static void init_cgroup_housekeeping(struct cgroup *cgrp)
{
struct cgroup_subsys *ss;
@@ -2100,7 +2095,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
#endif
init_waitqueue_head(&cgrp->offline_waitq);
- INIT_WORK(&cgrp->finish_destroy_work, cgroup_finish_destroy_work_fn);
INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
}
@@ -5695,6 +5689,22 @@ static void css_release(struct percpu_ref *ref)
queue_work(cgroup_release_wq, &css->destroy_work);
}
+/*
+ * Deferred kill_css_finish() fired from css_update_populated() once a dying
+ * css's hierarchical populated state drops to zero. Pinned by css_get() at the
+ * queue site; matched by css_put() here.
+ */
+static void kill_css_finish_work_fn(struct work_struct *work)
+{
+ struct cgroup_subsys_state *css =
+ container_of(work, struct cgroup_subsys_state, kill_finish_work);
+
+ cgroup_lock();
+ kill_css_finish(css);
+ cgroup_unlock();
+ css_put(css);
+}
+
static void init_and_link_css(struct cgroup_subsys_state *css,
struct cgroup_subsys *ss, struct cgroup *cgrp)
{
@@ -5708,6 +5718,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
css->id = -1;
INIT_LIST_HEAD(&css->sibling);
INIT_LIST_HEAD(&css->children);
+ INIT_WORK(&css->kill_finish_work, kill_css_finish_work_fn);
css->serial_nr = css_serial_nr_next++;
atomic_set(&css->online_cnt, 0);
@@ -6083,6 +6094,13 @@ static void kill_css_sync(struct cgroup_subsys_state *css)
css->flags |= CSS_DYING;
+ /*
+ * Pair with smp_mb() in css_update_populated(). Either our
+ * caller observes the walker's decrement and fires
+ * synchronously, or the walker observes CSS_DYING and queues.
+ */
+ smp_mb();
+
/*
* This must happen before css is disassociated with its cgroup.
* See seq_css() for details.
@@ -6158,9 +6176,9 @@ static void kill_css_finish(struct cgroup_subsys_state *css)
* - This function: synchronous user-visible state teardown plus kill_css_sync()
* on each subsystem css.
*
- * - cgroup_finish_destroy(): kicks the percpu_ref kill via kill_css_finish() on
- * each subsystem css. Fires once @cgrp's subtree is fully drained, either
- * inline here or from css_update_populated().
+ * - For each subsys css: fire kill_css_finish() synchronously if the subtree is
+ * already drained, otherwise rely on css_update_populated() to queue
+ * kill_finish_work when the last populated cset under the css empties.
*
* - The percpu_ref kill chain: css_killed_ref_fn -> css_killed_work_fn ->
* ->css_offline() -> release/free.
@@ -6238,29 +6256,14 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
/* put the base reference */
percpu_ref_kill(&cgrp->self.refcnt);
- if (!cgroup_is_populated(cgrp))
- cgroup_finish_destroy(cgrp);
+ for_each_css(css, ssid, cgrp) {
+ if (!css_is_populated(css))
+ kill_css_finish(css);
+ }
return 0;
};
-/**
- * cgroup_finish_destroy - deferred half of @cgrp destruction
- * @cgrp: cgroup whose subtree just became empty
- *
- * See cgroup_destroy_locked() for the rationale.
- */
-static void cgroup_finish_destroy(struct cgroup *cgrp)
-{
- struct cgroup_subsys_state *css;
- int ssid;
-
- lockdep_assert_held(&cgroup_mutex);
-
- for_each_css(css, ssid, cgrp)
- kill_css_finish(css);
-}
-
int cgroup_rmdir(struct kernfs_node *kn)
{
struct cgroup *cgrp;
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
2026-05-05 0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
` (3 preceding siblings ...)
2026-05-05 0:51 ` [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral Tejun Heo
@ 2026-05-05 0:51 ` Tejun Heo
4 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-05 0:51 UTC (permalink / raw)
To: Johannes Weiner, Michal Koutný
Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo
Same race shape as the rmdir path that 93618edf7538 ("cgroup: Defer css
percpu_ref kill on rmdir until cgroup is depopulated") fixed: a task past
exit_signals() whose cset subsys[ssid] still pins the disabled controller's
css can be touching subsys state while ->css_offline() runs. The earlier
patches in this series built up the per-subsys-css deferral machinery and
routed cgroup_destroy_locked() through it. Apply the same shape to
cgroup_apply_control_disable():
kill_css_sync(css);
if (!css_is_populated(css))
kill_css_finish(css);
When the dying css is still populated, kill_css_finish() is deferred. The
walker in css_update_populated() fires kill_finish_work once the css's
hierarchical populated count drops to zero.
cgroup_lock_and_drain_offline()'s wait predicate switches from
percpu_ref_is_dying() to css_is_dying(). CSS_DYING is set by kill_css_sync()
and is a strict superset of percpu_ref_is_dying. Without this change, a +cpu
re-enable after a deferred -cpu disable would skip the drain (percpu_ref
isn't killed yet) and observe the still-CSS_DYING css through cgroup_css(),
treating it as live.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fa24102535d9..bdc8deedb4f7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3237,7 +3237,7 @@ void cgroup_lock_and_drain_offline(struct cgroup *cgrp)
struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
DEFINE_WAIT(wait);
- if (!css || !percpu_ref_is_dying(&css->refcnt))
+ if (!css || !css_is_dying(css))
continue;
cgroup_get_live(dsct);
@@ -3405,7 +3405,8 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
if (css->parent &&
!(cgroup_ss_mask(dsct) & (1 << ss->id))) {
kill_css_sync(css);
- kill_css_finish(css);
+ if (!css_is_populated(css))
+ kill_css_finish(css);
} else if (!css_visible(css)) {
css_clear_dir(css);
if (ss->css_reset)
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-05 0:51 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
2026-05-05 0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
2026-05-05 0:51 ` [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE Tejun Heo
2026-05-05 0:51 ` [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state Tejun Heo
2026-05-05 0:51 ` [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral Tejun Heo
2026-05-05 0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox