[PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral
@ 2026-05-05  0:51 Tejun Heo
  2026-05-05  0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-05  0:51 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo

Hello,

Follow-up to 93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir
until cgroup is depopulated") in cgroup/for-7.1-fixes, assumed merged
into cgroup/for-7.2.

That commit fixed the rmdir race by deferring kill_css_finish() at the
cgroup level so ->css_offline() runs only after PF_EXITING tasks have
left the cgroup. cgroup_apply_control_disable() has the same race shape
(PF_EXITING tasks pinning the dying controller's css while
->css_offline() runs), but fixing it requires switching
cgroup_lock_and_drain_offline()'s wait predicate from
percpu_ref_is_dying() to css_is_dying() to cover the deferral window -
too invasive for -stable, hence -7.2.

This series:

  - Replaces the cgroup-level deferral with a per-subsys-css mechanism
    so each controller css independently defers kill_css_finish() until
    its own subtree drains.

  - Pairs smp_mb()s in kill_css_sync() and css_update_populated() to
    interlock the synchronous- and deferred-fire decisions.

  - Wires cgroup_apply_control_disable() through the per-css deferral
    and switches drain_offline to wait on css_is_dying.

After the predicate switch, a +ctrl re-enable issued while a deferred
-ctrl is still draining blocks in TASK_UNINTERRUPTIBLE on offline_waitq
until the dying css drains. Pre-existing for rmdir; the apply path now
joins it.

Verified by 200001 iterations of repro-a72f73c4dd9b, per-commit
deterministic repros for the bug-chain commits, 5292 iterations of
stress-disable-control, and targeted ftrace coverage of rmdir,
apply_disable, and nested-destroy paths. No warnings or stalls.

Based on cgroup/for-7.2 (d8769544bde5) with cgroup/for-7.1-fixes
(93618edf7538) assumed merged.

Patches:

  [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h
  [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
  [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state
  [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral
  [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-drain-for-7.2

 include/linux/cgroup-defs.h |  30 ++++---
 include/linux/cgroup.h      |  27 ++++++-
 kernel/cgroup/cgroup.c      | 188 +++++++++++++++++++++++++-------------------
 kernel/cgroup/cpuset-v1.c   |   2 +-
 kernel/cgroup/cpuset.c      |   2 +-
 5 files changed, 148 insertions(+), 101 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
@ 2026-05-05  0:51 ` Tejun Heo
  2026-05-05  0:51 ` [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-05  0:51 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo

cpuset reads cs->css.cgroup->nr_populated_csets directly in two places to
test whether a cgroup has tasks. cgroup.c already has a matching helper,
cgroup_has_tasks(). Move it to cgroup.h as static inline and use that
instead. This is to prepare for relocation of cgroup->nr_populated_csets. No
semantic change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h    | 5 +++++
 kernel/cgroup/cgroup.c    | 5 -----
 kernel/cgroup/cpuset-v1.c | 2 +-
 kernel/cgroup/cpuset.c    | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e52160e85af4..ceb87507667e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -639,6 +639,11 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 	return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
 }
 
+static inline bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+	return cgrp->nr_populated_csets;
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index bd10a7e2f9c5..7a94c2ea1036 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -376,11 +376,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
 	spin_unlock_bh(&cgroup_idr_lock);
 }
 
-static bool cgroup_has_tasks(struct cgroup *cgrp)
-{
-	return cgrp->nr_populated_csets;
-}
-
 static bool cgroup_is_threaded(struct cgroup *cgrp)
 {
 	return cgrp->dom_cgrp != cgrp;
diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
index 7308e9b02495..3e9968dd91e9 100644
--- a/kernel/cgroup/cpuset-v1.c
+++ b/kernel/cgroup/cpuset-v1.c
@@ -312,7 +312,7 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
 	 * This is full cgroup operation which will also call back into
 	 * cpuset. Execute it asynchronously using workqueue.
 	 */
-	if (is_empty && cs->css.cgroup->nr_populated_csets &&
+	if (is_empty && cgroup_has_tasks(cs->css.cgroup) &&
 	    css_tryget_online(&cs->css)) {
 		struct cpuset_remove_tasks_struct *s;
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3a081a07c6d..a76006b62b9c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -432,7 +432,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
 	 * nr_populated_domain_children may include populated
 	 * csets from descendants that are partitions.
 	 */
-	if (cs->css.cgroup->nr_populated_csets ||
+	if (cgroup_has_tasks(cs->css.cgroup) ||
 	    cs->attach_in_progress)
 		return true;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
  2026-05-05  0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
@ 2026-05-05  0:51 ` Tejun Heo
  2026-05-05  0:51 ` [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-05  0:51 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo

cgroup_update_populated() updates nr_populated_csets,
nr_populated_domain_children, and nr_populated_threaded_children under
css_set_lock, but cgroup_has_tasks(), cgroup_is_populated(), and
cgroup_can_be_thread_root() read them without holding it. Use
READ_ONCE/WRITE_ONCE.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h | 21 +++++++++++++++++----
 kernel/cgroup/cgroup.c | 11 +++++++----
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ceb87507667e..9f8bef8f3a60 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -639,16 +639,29 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 	return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
 }
 
+/*
+ * Populated counters: writes happen under css_set_lock. The accessors below
+ * may read unlocked. What an unpopulated result means depends on context:
+ *
+ * - No lock held. Just a snapshot. May race with concurrent updates and is
+ *   useful only as a hint.
+ *
+ * - cgroup_mutex held. Migration into the cgroup is blocked, so an observed
+ *   !populated stays !populated until cgroup_mutex is dropped.
+ *
+ * - CSS_DYING set. The css can no longer be repopulated, so !populated is
+ *   sticky once observed.
+ */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
-	return cgrp->nr_populated_csets;
+	return READ_ONCE(cgrp->nr_populated_csets);
 }
 
-/* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
-	return cgrp->nr_populated_csets + cgrp->nr_populated_domain_children +
-		cgrp->nr_populated_threaded_children;
+	return READ_ONCE(cgrp->nr_populated_csets) +
+		READ_ONCE(cgrp->nr_populated_domain_children) +
+		READ_ONCE(cgrp->nr_populated_threaded_children);
 }
 
 /* returns ino associated with a cgroup */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7a94c2ea1036..d1395784871a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -404,7 +404,7 @@ static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
 		return false;
 
 	/* can only have either domain or threaded children */
-	if (cgrp->nr_populated_domain_children)
+	if (READ_ONCE(cgrp->nr_populated_domain_children))
 		return false;
 
 	/* and no domain controllers can be enabled */
@@ -783,12 +783,15 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 		bool was_populated = cgroup_is_populated(cgrp);
 
 		if (!child) {
-			cgrp->nr_populated_csets += adj;
+			WRITE_ONCE(cgrp->nr_populated_csets,
+				   cgrp->nr_populated_csets + adj);
 		} else {
 			if (cgroup_is_threaded(child))
-				cgrp->nr_populated_threaded_children += adj;
+				WRITE_ONCE(cgrp->nr_populated_threaded_children,
+					   cgrp->nr_populated_threaded_children + adj);
 			else
-				cgrp->nr_populated_domain_children += adj;
+				WRITE_ONCE(cgrp->nr_populated_domain_children,
+					   cgrp->nr_populated_domain_children + adj);
 		}
 
 		if (was_populated == cgroup_is_populated(cgrp))
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
  2026-05-05  0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
  2026-05-05  0:51 ` [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE Tejun Heo
@ 2026-05-05  0:51 ` Tejun Heo
  2026-05-05  0:51 ` [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-05  0:51 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo

Later patches replace the cgroup-level finish_destroy_work deferral added
by 93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir until cgroup
is depopulated") with a per-subsys-css deferral. That needs each subsystem
css to track its own populated count. Move the populated counters from
cgroup onto cgroup_subsys_state. cgroup->self is itself a
cgroup_subsys_state and self.parent walks the same chain as cgroup_parent(),
so cgroup_update_populated() generalizes to a single css_update_populated()
taking a css. The cgroup-side bookkeeping runs only when the walk started
from a self css.

Keep nr_populated_{domain,threaded}_children on cgroup. Both sum to
self.nr_populated_children, but staying as dedicated fields to allow readers
like cgroup_can_be_thread_root() unlocked access.

css_set_update_populated() also walks the per-subsys-css chain so each
subsystem css's hierarchical populated count is maintained. No reader
consumes those counts yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h | 24 ++++++----
 include/linux/cgroup.h      | 11 +++--
 kernel/cgroup/cgroup.c      | 95 +++++++++++++++++++++----------------
 3 files changed, 76 insertions(+), 54 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 50a784da7a81..c4929f7bbe5a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -253,6 +253,15 @@ struct cgroup_subsys_state {
 	 */
 	int nr_descendants;
 
+	/*
+	 * Hierarchical populated state. For cgroup->self, nr_populated_csets
+	 * counts populated csets linked via cgrp_cset_link.
+	 * nr_populated_children counts immediate-child csses whose own
+	 * populated state is nonzero. Protected by css_set_lock.
+	 */
+	int nr_populated_csets;
+	int nr_populated_children;
+
 	/*
 	 * A singly-linked list of css structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
@@ -504,17 +513,12 @@ struct cgroup {
 	int max_descendants;
 
 	/*
-	 * Each non-empty css_set associated with this cgroup contributes
-	 * one to nr_populated_csets.  The counter is zero iff this cgroup
-	 * doesn't have any tasks.
-	 *
-	 * All children which have non-zero nr_populated_csets and/or
-	 * nr_populated_children of their own contribute one to either
-	 * nr_populated_domain_children or nr_populated_threaded_children
-	 * depending on their type.  Each counter is zero iff all cgroups
-	 * of the type in the subtree proper don't have any tasks.
+	 * Domain/threaded split of self.nr_populated_children: each counts
+	 * immediate-child cgroups whose subtree is populated and sums to
+	 * self.nr_populated_children. Kept as separate fields to allow readers
+	 * like cgroup_can_be_thread_root() unlocked access. Protected by
+	 * css_set_lock; updated by css_update_populated().
 	 */
-	int nr_populated_csets;
 	int nr_populated_domain_children;
 	int nr_populated_threaded_children;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9f8bef8f3a60..c2a8c38d8206 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,14 +654,17 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
  */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
-	return READ_ONCE(cgrp->nr_populated_csets);
+	return READ_ONCE(cgrp->self.nr_populated_csets);
+}
+
+static inline bool css_is_populated(struct cgroup_subsys_state *css)
+{
+	return READ_ONCE(css->nr_populated_csets) || READ_ONCE(css->nr_populated_children);
 }
 
 static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
-	return READ_ONCE(cgrp->nr_populated_csets) +
-		READ_ONCE(cgrp->nr_populated_domain_children) +
-		READ_ONCE(cgrp->nr_populated_threaded_children);
+	return css_is_populated(&cgrp->self);
 }
 
 /* returns ino associated with a cgroup */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d1395784871a..dd4ea9d83100 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -756,65 +756,70 @@ static bool css_set_populated(struct css_set *cset)
 }
 
 /**
- * cgroup_update_populated - update the populated count of a cgroup
- * @cgrp: the target cgroup
- * @populated: inc or dec populated count
- *
- * One of the css_sets associated with @cgrp is either getting its first
- * task or losing the last.  Update @cgrp->nr_populated_* accordingly.  The
- * count is propagated towards root so that a given cgroup's
- * nr_populated_children is zero iff none of its descendants contain any
- * tasks.
- *
- * @cgrp's interface file "cgroup.populated" is zero if both
- * @cgrp->nr_populated_csets and @cgrp->nr_populated_children are zero and
- * 1 otherwise.  When the sum changes from or to zero, userland is notified
- * that the content of the interface file has changed.  This can be used to
- * detect when @cgrp and its descendants become populated or empty.
+ * css_update_populated - update the populated state of a css and ancestors
+ * @css: leaf css whose own populated count is changing
+ * @populated: inc or dec
+ *
+ * One of the css_sets pinned by @css is getting its first task or losing the
+ * last. Propagate the transition up the parent chain so that a css's
+ * nr_populated_children is zero iff none of its descendants contain any tasks.
+ *
+ * For a cgroup->self walk, also runs cgroup-side bookkeeping at each level:
+ * domain/threaded child split, deferred-destroy trigger, and notification via
+ * "cgroup.populated" (zero iff cgrp->self has neither populated csets nor
+ * populated children; userland is notified on transitions).
  */
-static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+static void css_update_populated(struct cgroup_subsys_state *css, bool populated)
 {
-	struct cgroup *child = NULL;
+	struct cgroup_subsys_state *child = NULL;
 	int adj = populated ? 1 : -1;
 
 	lockdep_assert_held(&css_set_lock);
 
 	do {
-		bool was_populated = cgroup_is_populated(cgrp);
+		/* non-NULL only on the cgroup->self walk */
+		struct cgroup *cgrp = css_is_self(css) ? css->cgroup : NULL;
+		bool was_populated = css_is_populated(css);
 
 		if (!child) {
-			WRITE_ONCE(cgrp->nr_populated_csets,
-				   cgrp->nr_populated_csets + adj);
+			WRITE_ONCE(css->nr_populated_csets,
+				   css->nr_populated_csets + adj);
 		} else {
-			if (cgroup_is_threaded(child))
-				WRITE_ONCE(cgrp->nr_populated_threaded_children,
-					   cgrp->nr_populated_threaded_children + adj);
-			else
-				WRITE_ONCE(cgrp->nr_populated_domain_children,
-					   cgrp->nr_populated_domain_children + adj);
+			WRITE_ONCE(css->nr_populated_children,
+				   css->nr_populated_children + adj);
+			if (cgrp) {
+				if (cgroup_is_threaded(child->cgroup))
+					WRITE_ONCE(cgrp->nr_populated_threaded_children,
+						   cgrp->nr_populated_threaded_children + adj);
+				else
+					WRITE_ONCE(cgrp->nr_populated_domain_children,
+						   cgrp->nr_populated_domain_children + adj);
+			}
 		}
 
-		if (was_populated == cgroup_is_populated(cgrp))
+		if (was_populated == css_is_populated(css))
 			break;
 
 		/*
 		 * Subtree just emptied below an offlined cgrp. Fire deferred
 		 * destroy. The transition is one-shot.
 		 */
-		if (was_populated && !css_is_online(&cgrp->self)) {
+		if (cgrp && was_populated && !css_is_online(css)) {
 			cgroup_get(cgrp);
 			WARN_ON_ONCE(!queue_work(cgroup_offline_wq,
 						 &cgrp->finish_destroy_work));
 		}
 
-		cgroup1_check_for_release(cgrp);
-		TRACE_CGROUP_PATH(notify_populated, cgrp,
-				  cgroup_is_populated(cgrp));
-		cgroup_file_notify(&cgrp->events_file);
+		if (cgrp) {
+			cgroup1_check_for_release(cgrp);
+			TRACE_CGROUP_PATH(notify_populated, cgrp,
+					  cgroup_is_populated(cgrp));
+			cgroup_file_notify(&cgrp->events_file);
+		}
 
-		child = cgrp;
-		cgrp = cgroup_parent(cgrp);
-	} while (cgrp);
+		child = css;
+		css = css->parent;
+	} while (css);
 }
 
 /**
@@ -822,17 +827,27 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
  * @cset: target css_set
  * @populated: whether @cset is populated or depopulated
  *
- * @cset is either getting the first task or losing the last.  Update the
- * populated counters of all associated cgroups accordingly.
+ * @cset is either getting the first task or losing the last. Update the
+ * populated counters along each linked cgroup's self chain and each
+ * subsystem css that @cset pins.
  */
 static void css_set_update_populated(struct css_set *cset, bool populated)
 {
 	struct cgrp_cset_link *link;
+	struct cgroup_subsys *ss;
+	int ssid;
 
 	lockdep_assert_held(&css_set_lock);
 
 	list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
-		cgroup_update_populated(link->cgrp, populated);
+		css_update_populated(&link->cgrp->self, populated);
+
+	for_each_subsys(ss, ssid) {
+		struct cgroup_subsys_state *css = cset->subsys[ssid];
+
+		if (css)
+			css_update_populated(css, populated);
+	}
 }
 
 /*
@@ -2190,7 +2205,7 @@ int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask)
 	hash_for_each(css_set_table, i, cset, hlist) {
 		link_css_set(&tmp_links, cset, root_cgrp);
 		if (css_set_populated(cset))
-			cgroup_update_populated(root_cgrp, true);
+			css_update_populated(&root_cgrp->self, true);
 	}
 	spin_unlock_irq(&css_set_lock);
 
@@ -6145,7 +6160,7 @@ static void kill_css_finish(struct cgroup_subsys_state *css)
  *
  * - cgroup_finish_destroy(): kicks the percpu_ref kill via kill_css_finish() on
  *   each subsystem css. Fires once @cgrp's subtree is fully drained, either
- *   inline here or from cgroup_update_populated().
+ *   inline here or from css_update_populated().
  *
  * - The percpu_ref kill chain: css_killed_ref_fn -> css_killed_work_fn ->
  *   ->css_offline() -> release/free.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
                   ` (2 preceding siblings ...)
  2026-05-05  0:51 ` [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state Tejun Heo
@ 2026-05-05  0:51 ` Tejun Heo
  2026-05-05  0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-05  0:51 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo

93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir until cgroup is
depopulated") deferred kill_css_finish() at the cgroup level: rmdir waits
for the entire cgroup's populated count to drop to zero, then fires
kill_css_finish() on every subsystem css at once. Replace that with
per-subsys-css deferral. Each subsystem css now tracks its own hierarchical
populated count and independently defers its kill_css_finish() until its own
subtree drains.

The rmdir-race fix carries through unchanged in shape. The dying css's
->css_offline() still waits until no PF_EXITING task references it, and v2's
cgroup-level machinery goes away.

cgroup_apply_control_disable() has the same race shape (PF_EXITING tasks
pinning a css whose ->css_offline() is about to run) and stays synchronous
here. This patch lays the groundwork for fixing it - per-cgroup waiting
can't gate one subsys css being killed while the rest of the cgroup stays
live, but per-css can.

Subtree-wide invariant preserved: a dying ancestor css stays populated
through nr_populated_children until every dying descendant's task drains, so
the walker fires the ancestor's kill_finish_work only after all descendants
have drained.

Add paired smp_mb()s in kill_css_sync() and css_update_populated() to fence
the StoreLoad on (CSS_DYING, populated counter), guaranteeing that either
the walker queues kill_finish_work or the caller fires synchronously.
cgroup_destroy_locked() was implicitly fenced by an unrelated css_set_lock
pair; cgroup_apply_control_disable() in the next patch is not.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h |  6 +--
 kernel/cgroup/cgroup.c      | 83 +++++++++++++++++++------------------
 2 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c4929f7bbe5a..de2cd6238c2a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -262,6 +262,9 @@ struct cgroup_subsys_state {
 	int nr_populated_csets;
 	int nr_populated_children;
 
+	/* deferred kill_css_finish() queued by css_update_populated() */
+	struct work_struct kill_finish_work;
+
 	/*
 	 * A singly-linked list of css structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
@@ -615,9 +618,6 @@ struct cgroup {
 	/* used to wait for offlining of csses */
 	wait_queue_head_t offline_waitq;
 
-	/* defers killing csses after removal until cgroup is depopulated */
-	struct work_struct finish_destroy_work;
-
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index dd4ea9d83100..fa24102535d9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -264,7 +264,6 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret);
 static void css_task_iter_skip(struct css_task_iter *it,
 			       struct task_struct *task);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
-static void cgroup_finish_destroy(struct cgroup *cgrp);
 static void kill_css_sync(struct cgroup_subsys_state *css);
 static void kill_css_finish(struct cgroup_subsys_state *css);
 static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
@@ -801,13 +800,19 @@ static void css_update_populated(struct cgroup_subsys_state *css, bool populated
 			break;
 
 		/*
-		 * Subtree just emptied below an offlined cgrp. Fire deferred
-		 * destroy. The transition is one-shot.
+		 * Pair with smp_mb() in kill_css_sync(). Either we observe
+		 * CSS_DYING and queue, or the caller observes our decrement
+		 * and fires synchronously.
 		 */
-		if (cgrp && was_populated && !css_is_online(css)) {
-			cgroup_get(cgrp);
-			WARN_ON_ONCE(!queue_work(cgroup_offline_wq,
-						 &cgrp->finish_destroy_work));
+		smp_mb();
+
+		/*
+		 * Subtree just emptied below a dying css. Fire deferred kill.
+		 * The transition is one-shot for a dying css.
+		 */
+		if (was_populated && css_is_dying(css)) {
+			css_get(css);
+			WARN_ON_ONCE(!queue_work(cgroup_offline_wq, &css->kill_finish_work));
 		}
 
 		if (cgrp) {
@@ -2064,16 +2069,6 @@ static int cgroup_reconfigure(struct fs_context *fc)
 	return 0;
 }
 
-static void cgroup_finish_destroy_work_fn(struct work_struct *work)
-{
-	struct cgroup *cgrp = container_of(work, struct cgroup, finish_destroy_work);
-
-	cgroup_lock();
-	cgroup_finish_destroy(cgrp);
-	cgroup_unlock();
-	cgroup_put(cgrp);
-}
-
 static void init_cgroup_housekeeping(struct cgroup *cgrp)
 {
 	struct cgroup_subsys *ss;
@@ -2100,7 +2095,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 #endif
 
 	init_waitqueue_head(&cgrp->offline_waitq);
-	INIT_WORK(&cgrp->finish_destroy_work, cgroup_finish_destroy_work_fn);
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
 }
 
@@ -5695,6 +5689,22 @@ static void css_release(struct percpu_ref *ref)
 	queue_work(cgroup_release_wq, &css->destroy_work);
 }
 
+/*
+ * Deferred kill_css_finish() fired from css_update_populated() once a dying
+ * css's hierarchical populated state drops to zero. Pinned by css_get() at the
+ * queue site; matched by css_put() here.
+ */
+static void kill_css_finish_work_fn(struct work_struct *work)
+{
+	struct cgroup_subsys_state *css =
+		container_of(work, struct cgroup_subsys_state, kill_finish_work);
+
+	cgroup_lock();
+	kill_css_finish(css);
+	cgroup_unlock();
+	css_put(css);
+}
+
 static void init_and_link_css(struct cgroup_subsys_state *css,
 			      struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
@@ -5708,6 +5718,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
+	INIT_WORK(&css->kill_finish_work, kill_css_finish_work_fn);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -6083,6 +6094,13 @@ static void kill_css_sync(struct cgroup_subsys_state *css)
 
 	css->flags |= CSS_DYING;
 
+	/*
+	 * Pair with smp_mb() in css_update_populated(). Either our
+	 * caller observes the walker's decrement and fires
+	 * synchronously, or the walker observes CSS_DYING and queues.
+	 */
+	smp_mb();
+
 	/*
 	 * This must happen before css is disassociated with its cgroup.
 	 * See seq_css() for details.
@@ -6158,9 +6176,9 @@ static void kill_css_finish(struct cgroup_subsys_state *css)
  * - This function: synchronous user-visible state teardown plus kill_css_sync()
  *   on each subsystem css.
  *
- * - cgroup_finish_destroy(): kicks the percpu_ref kill via kill_css_finish() on
- *   each subsystem css. Fires once @cgrp's subtree is fully drained, either
- *   inline here or from css_update_populated().
+ * - For each subsys css: fire kill_css_finish() synchronously if the subtree is
+ *   already drained, otherwise rely on css_update_populated() to queue
+ *   kill_finish_work when the last populated cset under the css empties.
  *
  * - The percpu_ref kill chain: css_killed_ref_fn -> css_killed_work_fn ->
  *   ->css_offline() -> release/free.
@@ -6238,29 +6256,14 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	/* put the base reference */
 	percpu_ref_kill(&cgrp->self.refcnt);
 
-	if (!cgroup_is_populated(cgrp))
-		cgroup_finish_destroy(cgrp);
+	for_each_css(css, ssid, cgrp) {
+		if (!css_is_populated(css))
+			kill_css_finish(css);
+	}
 
 	return 0;
 };
 
-/**
- * cgroup_finish_destroy - deferred half of @cgrp destruction
- * @cgrp: cgroup whose subtree just became empty
- *
- * See cgroup_destroy_locked() for the rationale.
- */
-static void cgroup_finish_destroy(struct cgroup *cgrp)
-{
-	struct cgroup_subsys_state *css;
-	int ssid;
-
-	lockdep_assert_held(&cgroup_mutex);
-
-	for_each_css(css, ssid, cgrp)
-		kill_css_finish(css);
-}
-
 int cgroup_rmdir(struct kernfs_node *kn)
 {
 	struct cgroup *cgrp;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
                   ` (3 preceding siblings ...)
  2026-05-05  0:51 ` [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral Tejun Heo
@ 2026-05-05  0:51 ` Tejun Heo
  2026-05-27 10:45   ` Mark Brown
  2026-05-13 21:01 ` [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
  2026-05-15 17:28 ` Tejun Heo
  6 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2026-05-05  0:51 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel, Tejun Heo

Same race shape as the rmdir path that 93618edf7538 ("cgroup: Defer css
percpu_ref kill on rmdir until cgroup is depopulated") fixed: a task past
exit_signals() whose cset subsys[ssid] still pins the disabled controller's
css can be touching subsys state while ->css_offline() runs. The earlier
patches in this series built up the per-subsys-css deferral machinery and
routed cgroup_destroy_locked() through it. Apply the same shape to
cgroup_apply_control_disable():

	kill_css_sync(css);
	if (!css_is_populated(css))
		kill_css_finish(css);

When the dying css is still populated, kill_css_finish() is deferred. The
walker in css_update_populated() fires kill_finish_work once the css's
hierarchical populated count drops to zero.

cgroup_lock_and_drain_offline()'s wait predicate switches from
percpu_ref_is_dying() to css_is_dying(). CSS_DYING is set by kill_css_sync()
and is a strict superset of percpu_ref_is_dying. Without this change, a +cpu
re-enable after a deferred -cpu disable would skip the drain (percpu_ref
isn't killed yet) and observe the still-CSS_DYING css through cgroup_css(),
treating it as live.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fa24102535d9..bdc8deedb4f7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3237,7 +3237,7 @@ void cgroup_lock_and_drain_offline(struct cgroup *cgrp)
 			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
 			DEFINE_WAIT(wait);
 
-			if (!css || !percpu_ref_is_dying(&css->refcnt))
+			if (!css || !css_is_dying(css))
 				continue;
 
 			cgroup_get_live(dsct);
@@ -3405,7 +3405,8 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 			if (css->parent &&
 			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
 				kill_css_sync(css);
-				kill_css_finish(css);
+				if (!css_is_populated(css))
+					kill_css_finish(css);
 			} else if (!css_visible(css)) {
 				css_clear_dir(css);
 				if (ss->css_reset)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-05  0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
@ 2026-05-27 10:45   ` Mark Brown
  2026-05-29 17:25     ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Brown @ 2026-05-27 10:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior,
	Petr Malat, Bert Karwatzki, kernel test robot, Martin Pitt,
	cgroups, linux-kernel, Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 16233 bytes --]

On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:

> Same race shape as the rmdir path that 93618edf7538 ("cgroup: Defer css
> percpu_ref kill on rmdir until cgroup is depopulated") fixed: a task past
> exit_signals() whose cset subsys[ssid] still pins the disabled controller's
> css can be touching subsys state while ->css_offline() runs. The earlier
> patches in this series built up the per-subsys-css deferral machinery and
> routed cgroup_destroy_locked() through it. Apply the same shape to
> cgroup_apply_control_disable():

We've been seeing hangs during testing in our testing of -next on
multiple arm64 platforms when running LTP test jobs which bisect to this
patch, which is 1dffd95575eb05bc7e in -next.  It looks like we hit a
deadlock running stress tests, the end of a typical log looks like this:

<12>[  181.849144] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_3_3_none: end (returncode: 0)
<12>[  181.860375] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_3_3_one: start (command: cgroup_fj_stress.sh blkio 3 3 one)
cgroup_fj_stress_blkio_3_3_one: pass  (1.166s)
<12>[  183.053379] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_3_3_one: end (returncode: 0)
<12>[  183.064884] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_each: start (command: cgroup_fj_stress.sh blkio 4 4 each)
cgroup_fj_stress_blkio_4_4_each: pass  (8.183s)
<12>[  191.275815] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_each: end (returncode: 0)
<12>[  191.287614] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_none: start (command: cgroup_fj_stress.sh blkio 4 4 none)
cgroup_fj_stress_blkio_4_4_none: pass  (3.570s)
<12>[  194.884173] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_none: end (returncode: 0)
<12>[  194.895255] /opt/ltp/kirk[558]: cgroup_fj_stress_cpu_1_200_each: start (command: cgroup_fj_stress.sh cpu 1 200 each)

with no further output and given that this is a cgroup locking change
this does seem like a plausible commmit, though I didn't look into it in
detail.  Bisect log and the list of LTP tests we're running in our test
job below.  We are running multuple tests in parallel.

bisect log:

git bisect start
# status: waiting for both good and bad commits
# bad: [d387b06f7c15b4639244ad66b4b0900c6a02b430] Add linux-next specific files for 20260525
git bisect bad d387b06f7c15b4639244ad66b4b0900c6a02b430
# status: waiting for good commit(s), bad commit known
# good: [c745c46074da99cdcef8c1fc6093030c6f9d7143] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
git bisect good c745c46074da99cdcef8c1fc6093030c6f9d7143
# good: [4c0ea14d8ec6f6fcb94b7ad9248679ffcf747e9b] Merge branch 'libcrypto-next' of https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git
git bisect good 4c0ea14d8ec6f6fcb94b7ad9248679ffcf747e9b
# good: [af3c2d822a4b67034aa47554c178b5ffcf973456] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
git bisect good af3c2d822a4b67034aa47554c178b5ffcf973456
# good: [2f8df83b202ec7f8080a25b2e9da79c3361775fd] Merge branch 'char-misc-next' of https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
git bisect good 2f8df83b202ec7f8080a25b2e9da79c3361775fd
# good: [1c70737e40e31fb0ba2d49ba06f8cac7a5e809a3] Merge branch 'staging-next' of https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
git bisect good 1c70737e40e31fb0ba2d49ba06f8cac7a5e809a3
# bad: [4be0d6b749c63b04d4daebf43925149271943af4] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl.git
git bisect bad 4be0d6b749c63b04d4daebf43925149271943af4
# bad: [1feee04d568fb4c3f56beaea3735da71d159b6f6] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux.git
git bisect bad 1feee04d568fb4c3f56beaea3735da71d159b6f6
# bad: [adfcff24160cca9a20fd0bf8a1b8b5cacba6061d] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
git bisect bad adfcff24160cca9a20fd0bf8a1b8b5cacba6061d
# good: [a234764e334b01b4b4a631b1c94df3458f3f57cb] Merge branch 'for-7.1-fixes' into for-next
git bisect good a234764e334b01b4b4a631b1c94df3458f3f57cb
# bad: [81807796db07bb1a4c066c75ccb5fcf04cbea3ed] Merge branch 'for-7.1-fixes' into for-next
git bisect bad 81807796db07bb1a4c066c75ccb5fcf04cbea3ed
# good: [c4799253a3ee74ebb27be72fb991c597a5902c01] cgroup: Move populated counters to cgroup_subsys_state
git bisect good c4799253a3ee74ebb27be72fb991c597a5902c01
# bad: [6c79fb30f5cd939f22959bd9b54d7f30c713a759] Merge branch 'for-7.2' into for-next
git bisect bad 6c79fb30f5cd939f22959bd9b54d7f30c713a759
# bad: [1dffd95575eb05bc7ec20ec096ce73be4c5d1ed5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
git bisect bad 1dffd95575eb05bc7ec20ec096ce73be4c5d1ed5
# good: [cfc1da7e1127b4c8787f4dc25d59987c10c9107f] cgroup: Add per-subsys-css kill_css_finish deferral
git bisect good cfc1da7e1127b4c8787f4dc25d59987c10c9107f
# first bad commit: [1dffd95575eb05bc7ec20ec096ce73be4c5d1ed5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()

test list:

abort01 abort01
abs01 abs01
accept01 accept01
accept4_01 accept4_01
access01 access01
access02 access02
access03 access03
access04 access04
acct01 acct01
acct02 acct02
add_key01 add_key01
add_key02 add_key02
add_key03 add_key03
add_key04 add_key04
add_key05 add_key05
adjtimex01 adjtimex01
adjtimex02 adjtimex02
adjtimex03 adjtimex03
alarm02 alarm02
alarm03 alarm03
alarm05 alarm05
alarm06 alarm06
alarm07 alarm07
ar_sh export TCdat=$LTPROOT/testcases/bin; ar01.sh
asapi_01 asapi_01
asapi_02 asapi_02
asapi_03 asapi_03
atof01 atof01
autogroup01 autogroup01
bind01 bind01
bind02 bind02
bind03 bind03
bind04 bind04
bind05 bind05
binfmt_misc01 binfmt_misc01.sh
binfmt_misc02 binfmt_misc02.sh
brk01 brk01
capget01 capget01
capget02 capget02
capset01 capset01
capset02 capset02
capset03 capset03
capset04 capset04
cgroup_fj_function_blkio cgroup_fj_function.sh blkio
cgroup_fj_function_cpu cgroup_fj_function.sh cpu
cgroup_fj_function_cpuacct cgroup_fj_function.sh cpuacct
cgroup_fj_function_cpuset cgroup_fj_function.sh cpuset
cgroup_fj_function_devices cgroup_fj_function.sh devices
cgroup_fj_function_hugetlb cgroup_fj_function.sh hugetlb
cgroup_fj_function_memory cgroup_fj_function.sh memory
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_stress_blkio_1_200_each cgroup_fj_stress.sh blkio 1 200 each
cgroup_fj_stress_blkio_1_200_none cgroup_fj_stress.sh blkio 1 200 none
cgroup_fj_stress_blkio_1_200_one cgroup_fj_stress.sh blkio 1 200 one
cgroup_fj_stress_blkio_200_1_each cgroup_fj_stress.sh blkio 200 1 each
cgroup_fj_stress_blkio_200_1_none cgroup_fj_stress.sh blkio 200 1 none
cgroup_fj_stress_blkio_2_2_each cgroup_fj_stress.sh blkio 2 2 each
cgroup_fj_stress_blkio_2_2_none cgroup_fj_stress.sh blkio 2 2 none
cgroup_fj_stress_blkio_2_2_one cgroup_fj_stress.sh blkio 2 2 one
cgroup_fj_stress_blkio_2_9_none cgroup_fj_stress.sh blkio 2 9 none
cgroup_fj_stress_blkio_3_3_each cgroup_fj_stress.sh blkio 3 3 each
cgroup_fj_stress_blkio_3_3_none cgroup_fj_stress.sh blkio 3 3 none
cgroup_fj_stress_blkio_3_3_one cgroup_fj_stress.sh blkio 3 3 one
cgroup_fj_stress_blkio_4_4_each cgroup_fj_stress.sh blkio 4 4 each
cgroup_fj_stress_blkio_4_4_none cgroup_fj_stress.sh blkio 4 4 none
cgroup_fj_stress_cpu_1_200_each cgroup_fj_stress.sh cpu 1 200 each
cgroup_fj_stress_cpu_1_200_none cgroup_fj_stress.sh cpu 1 200 none
cgroup_fj_stress_cpu_1_200_one cgroup_fj_stress.sh cpu 1 200 one
cgroup_fj_stress_cpu_200_1_each cgroup_fj_stress.sh cpu 200 1 each
cgroup_fj_stress_cpu_200_1_none cgroup_fj_stress.sh cpu 200 1 none
cgroup_fj_stress_cpu_2_2_each cgroup_fj_stress.sh cpu 2 2 each
cgroup_fj_stress_cpu_2_2_none cgroup_fj_stress.sh cpu 2 2 none
cgroup_fj_stress_cpu_2_2_one cgroup_fj_stress.sh cpu 2 2 one
cgroup_fj_stress_cpu_2_9_none cgroup_fj_stress.sh cpu 2 9 none
cgroup_fj_stress_cpu_3_3_each cgroup_fj_stress.sh cpu 3 3 each
cgroup_fj_stress_cpu_3_3_none cgroup_fj_stress.sh cpu 3 3 none
cgroup_fj_stress_cpu_3_3_one cgroup_fj_stress.sh cpu 3 3 one
cgroup_fj_stress_cpu_4_4_each cgroup_fj_stress.sh cpu 4 4 each
cgroup_fj_stress_cpu_4_4_none cgroup_fj_stress.sh cpu 4 4 none
cgroup_fj_stress_cpuacct_10_3_none cgroup_fj_stress.sh cpuacct 10 3 none
cgroup_fj_stress_cpuacct_1_200_each cgroup_fj_stress.sh cpuacct 1 200 each
cgroup_fj_stress_cpuacct_1_200_none cgroup_fj_stress.sh cpuacct 1 200 none
cgroup_fj_stress_cpuacct_1_200_one cgroup_fj_stress.sh cpuacct 1 200 one
cgroup_fj_stress_cpuacct_200_1_none cgroup_fj_stress.sh cpuacct 200 1 none
cgroup_fj_stress_cpuacct_200_1_one cgroup_fj_stress.sh cpuacct 200 1 one
cgroup_fj_stress_cpuacct_2_2_each cgroup_fj_stress.sh cpuacct 2 2 each
cgroup_fj_stress_cpuacct_2_2_none cgroup_fj_stress.sh cpuacct 2 2 none
cgroup_fj_stress_cpuacct_2_2_one cgroup_fj_stress.sh cpuacct 2 2 one
cgroup_fj_stress_cpuacct_2_9_none cgroup_fj_stress.sh cpuacct 2 9 none
cgroup_fj_stress_cpuacct_3_3_each cgroup_fj_stress.sh cpuacct 3 3 each
cgroup_fj_stress_cpuacct_3_3_none cgroup_fj_stress.sh cpuacct 3 3 none
cgroup_fj_stress_cpuacct_3_3_one cgroup_fj_stress.sh cpuacct 3 3 one
cgroup_fj_stress_cpuacct_4_4_each cgroup_fj_stress.sh cpuacct 4 4 each
cgroup_fj_stress_cpuacct_4_4_none cgroup_fj_stress.sh cpuacct 4 4 none
cgroup_fj_stress_cpuset_1_200_each cgroup_fj_stress.sh cpuset 1 200 each
cgroup_fj_stress_cpuset_1_200_none cgroup_fj_stress.sh cpuset 1 200 none
cgroup_fj_stress_cpuset_1_200_one cgroup_fj_stress.sh cpuset 1 200 one
cgroup_fj_stress_cpuset_200_1_none cgroup_fj_stress.sh cpuset 200 1 none
cgroup_fj_stress_cpuset_2_2_each cgroup_fj_stress.sh cpuset 2 2 each
cgroup_fj_stress_cpuset_2_2_none cgroup_fj_stress.sh cpuset 2 2 none
cgroup_fj_stress_cpuset_2_2_one cgroup_fj_stress.sh cpuset 2 2 one
cgroup_fj_stress_cpuset_3_3_each cgroup_fj_stress.sh cpuset 3 3 each
cgroup_fj_stress_cpuset_3_3_none cgroup_fj_stress.sh cpuset 3 3 none
cgroup_fj_stress_cpuset_3_3_one cgroup_fj_stress.sh cpuset 3 3 one
cgroup_fj_stress_cpuset_4_4_none cgroup_fj_stress.sh cpuset 4 4 none
cgroup_fj_stress_devices_10_3_none cgroup_fj_stress.sh devices 10 3 none
cgroup_fj_stress_devices_1_200_each cgroup_fj_stress.sh devices 1 200 each
cgroup_fj_stress_devices_1_200_none cgroup_fj_stress.sh devices 1 200 none
cgroup_fj_stress_devices_1_200_one cgroup_fj_stress.sh devices 1 200 one
cgroup_fj_stress_devices_200_1_each cgroup_fj_stress.sh devices 200 1 each
cgroup_fj_stress_devices_200_1_none cgroup_fj_stress.sh devices 200 1 none
cgroup_fj_stress_devices_2_2_each cgroup_fj_stress.sh devices 2 2 each
cgroup_fj_stress_devices_2_2_none cgroup_fj_stress.sh devices 2 2 none
cgroup_fj_stress_devices_2_2_one cgroup_fj_stress.sh devices 2 2 one
cgroup_fj_stress_devices_2_9_none cgroup_fj_stress.sh devices 2 9 none
cgroup_fj_stress_devices_3_3_each cgroup_fj_stress.sh devices 3 3 each
cgroup_fj_stress_devices_3_3_none cgroup_fj_stress.sh devices 3 3 none
cgroup_fj_stress_devices_3_3_one cgroup_fj_stress.sh devices 3 3 one
cgroup_fj_stress_devices_4_4_each cgroup_fj_stress.sh devices 4 4 each
cgroup_fj_stress_devices_4_4_none cgroup_fj_stress.sh devices 4 4 none
cgroup_fj_stress_hugetlb_1_200_each cgroup_fj_stress.sh hugetlb 1 200 each
cgroup_fj_stress_hugetlb_1_200_none cgroup_fj_stress.sh hugetlb 1 200 none
cgroup_fj_stress_hugetlb_1_200_one cgroup_fj_stress.sh hugetlb 1 200 one
cgroup_fj_stress_hugetlb_200_1_each cgroup_fj_stress.sh hugetlb 200 1 each
cgroup_fj_stress_hugetlb_200_1_none cgroup_fj_stress.sh hugetlb 200 1 none
cgroup_fj_stress_hugetlb_2_2_each cgroup_fj_stress.sh hugetlb 2 2 each
cgroup_fj_stress_hugetlb_2_2_none cgroup_fj_stress.sh hugetlb 2 2 none
cgroup_fj_stress_hugetlb_2_2_one cgroup_fj_stress.sh hugetlb 2 2 one
cgroup_fj_stress_hugetlb_2_9_none cgroup_fj_stress.sh hugetlb 2 9 none
cgroup_fj_stress_hugetlb_3_3_each cgroup_fj_stress.sh hugetlb 3 3 each
cgroup_fj_stress_hugetlb_3_3_none cgroup_fj_stress.sh hugetlb 3 3 none
cgroup_fj_stress_hugetlb_3_3_one cgroup_fj_stress.sh hugetlb 3 3 one
cgroup_fj_stress_hugetlb_4_4_each cgroup_fj_stress.sh hugetlb 4 4 each
cgroup_fj_stress_hugetlb_4_4_none cgroup_fj_stress.sh hugetlb 4 4 none
cgroup_fj_stress_memory_10_3_none cgroup_fj_stress.sh memory 10 3 none
cgroup_fj_stress_memory_1_200_each cgroup_fj_stress.sh memory 1 200 each
cgroup_fj_stress_memory_1_200_none cgroup_fj_stress.sh memory 1 200 none
cgroup_fj_stress_memory_1_200_one cgroup_fj_stress.sh memory 1 200 one
cgroup_fj_stress_memory_200_1_each cgroup_fj_stress.sh memory 200 1 each
cgroup_fj_stress_memory_200_1_none cgroup_fj_stress.sh memory 200 1 none
cgroup_fj_stress_memory_2_2_each cgroup_fj_stress.sh memory 2 2 each
cgroup_fj_stress_memory_2_2_none cgroup_fj_stress.sh memory 2 2 none
cgroup_fj_stress_memory_2_2_one cgroup_fj_stress.sh memory 2 2 one
cgroup_fj_stress_memory_2_9_none cgroup_fj_stress.sh memory 2 9 none
cgroup_fj_stress_memory_3_3_each cgroup_fj_stress.sh memory 3 3 each
cgroup_fj_stress_memory_3_3_none cgroup_fj_stress.sh memory 3 3 none
cgroup_fj_stress_memory_3_3_one cgroup_fj_stress.sh memory 3 3 one
cgroup_fj_stress_memory_4_4_each cgroup_fj_stress.sh memory 4 4 each
cgroup_fj_stress_memory_4_4_none cgroup_fj_stress.sh memory 4 4 none
cgroup_fj_stress_perf_event_1_200_each cgroup_fj_stress.sh perf_event 1 200 each
cgroup_fj_stress_perf_event_1_200_none cgroup_fj_stress.sh perf_event 1 200 none
cgroup_fj_stress_perf_event_1_200_one cgroup_fj_stress.sh perf_event 1 200 one
cgroup_fj_stress_perf_event_200_1_none cgroup_fj_stress.sh perf_event 200 1 none
cgroup_fj_stress_perf_event_200_1_one cgroup_fj_stress.sh perf_event 200 1 one
cgroup_fj_stress_perf_event_2_2_each cgroup_fj_stress.sh perf_event 2 2 each
cgroup_fj_stress_perf_event_2_2_none cgroup_fj_stress.sh perf_event 2 2 none
cgroup_fj_stress_perf_event_2_2_one cgroup_fj_stress.sh perf_event 2 2 one
cgroup_fj_stress_perf_event_2_9_none cgroup_fj_stress.sh perf_event 2 9 none
cgroup_fj_stress_perf_event_3_3_each cgroup_fj_stress.sh perf_event 3 3 each
cgroup_fj_stress_perf_event_3_3_none cgroup_fj_stress.sh perf_event 3 3 none
cgroup_fj_stress_perf_event_3_3_one cgroup_fj_stress.sh perf_event 3 3 one
cgroup_fj_stress_perf_event_4_4_each cgroup_fj_stress.sh perf_event 4 4 each
cgroup_fj_stress_perf_event_4_4_none cgroup_fj_stress.sh perf_event 4 4 none
cgroup_xattr cgroup_xattr
chdir01 chdir01
chdir04 chdir04
chmod01 chmod01
chmod03 chmod03
chmod05 chmod05
chmod06 chmod06
chmod07 chmod07
chown01 chown01
chown02 chown02
chown03 chown03
chown04 chown04
chown05 chown05
chroot01 chroot01
chroot02 chroot02
chroot03 chroot03
chroot04 chroot04
clock_adjtime01 clock_adjtime01
clock_adjtime02 clock_adjtime02
clock_getres01 clock_getres01
clock_gettime01 clock_gettime01
clock_gettime02 clock_gettime02
clock_nanosleep01 clock_nanosleep01
clock_nanosleep02 clock_nanosleep02
clock_nanosleep03 clock_nanosleep03
clock_nanosleep03 clock_nanosleep03
clock_nanosleep04 clock_nanosleep04
clock_settime01 clock_settime01
clock_settime02 clock_settime02
clock_settime03 clock_settime03
clone01 clone01
clone02 clone02
clone03 clone03
clone04 clone04
clone05 clone05
clone06 clone06
clone07 clone07
clone08 clone08
clone09 clone09
clone301 clone301
clone302 clone302
close01 close01
close02 close02
confstr01 confstr01
connect01 connect01
connect02 connect02
copy_file_range01 copy_file_range01
copy_file_range02 copy_file_range02
copy_file_range03 copy_file_range03
cp01_sh cp_tests.sh
cpio01_sh cpio_tests.sh
cpuacct_100_1 cpuacct.sh 100 1
cpuacct_10_10 cpuacct.sh 10 10
cpuacct_1_1 cpuacct.sh 1 1
cpuacct_1_10 cpuacct.sh 1 10
cpuacct_1_100 cpuacct.sh 1 100
cpuhotplug02 cpuhotplug02.sh -c 1 -l 1
cpuhotplug03 cpuhotplug03.sh -c 1 -l 1
cpuhotplug04 cpuhotplug04.sh -l 1
cpuhotplug06 cpuhotplug06.sh -c 1 -l 1
cpuset_hotplug cpuset_hotplug_test.sh
cpuset_inherit cpuset_inherit_testset.sh
cpuset_regression_test cpuset_regression_test.sh
creat01 creat01
creat03 creat03
creat04 creat04
creat05 creat05
creat06 creat06
creat08 creat08
cve-2011-2183 ksm05 -I 10
cve-2012-0957 uname04
cve-2014-0196 cve-2014-0196
cve-2015-0235 gethostbyname_r01
cve-2015-7550 keyctl02
cve-2016-10044 cve-2016-10044
cve-2016-4997 setsockopt03
cve-2016-5195 dirtyc0w
cve-2016-7042 cve-2016-7042
cve-2016-7117 cve-2016-7117
cve-2016-8655 setsockopt06

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-27 10:45   ` Mark Brown
@ 2026-05-29 17:25     ` Tejun Heo
  2026-05-29 21:08       ` Mark Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2026-05-29 17:25 UTC (permalink / raw)
  To: Mark Brown
  Cc: Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior,
	Petr Malat, Bert Karwatzki, kernel test robot, Martin Pitt,
	cgroups, linux-kernel, Aishwarya.TCV

Hello, Mark.

On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:
> 
> > Same race shape as the rmdir path that 93618edf7538 ("cgroup: Defer css
> > percpu_ref kill on rmdir until cgroup is depopulated") fixed: a task past
> > exit_signals() whose cset subsys[ssid] still pins the disabled controller's
> > css can be touching subsys state while ->css_offline() runs. The earlier
> > patches in this series built up the per-subsys-css deferral machinery and
> > routed cgroup_destroy_locked() through it. Apply the same shape to
> > cgroup_apply_control_disable():
> 
> We've been seeing hangs during testing in our testing of -next on
> multiple arm64 platforms when running LTP test jobs which bisect to this
> patch, which is 1dffd95575eb05bc7e in -next.  It looks like we hit a
> deadlock running stress tests, the end of a typical log looks like this:
> 
> <12>[  181.849144] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_3_3_none: end (returncode: 0)
> <12>[  181.860375] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_3_3_one: start (command: cgroup_fj_stress.sh blkio 3 3 one)
> cgroup_fj_stress_blkio_3_3_one: pass  (1.166s)
> <12>[  183.053379] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_3_3_one: end (returncode: 0)
> <12>[  183.064884] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_each: start (command: cgroup_fj_stress.sh blkio 4 4 each)
> cgroup_fj_stress_blkio_4_4_each: pass  (8.183s)
> <12>[  191.275815] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_each: end (returncode: 0)
> <12>[  191.287614] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_none: start (command: cgroup_fj_stress.sh blkio 4 4 none)
> cgroup_fj_stress_blkio_4_4_none: pass  (3.570s)
> <12>[  194.884173] /opt/ltp/kirk[558]: cgroup_fj_stress_blkio_4_4_none: end (returncode: 0)
> <12>[  194.895255] /opt/ltp/kirk[558]: cgroup_fj_stress_cpu_1_200_each: start (command: cgroup_fj_stress.sh cpu 1 200 each)
> 
> with no further output and given that this is a cgroup locking change
> this does seem like a plausible commmit, though I didn't look into it in
> detail.  Bisect log and the list of LTP tests we're running in our test
> job below.  We are running multuple tests in parallel.

Unfortunately, I can't reproduce this in my environment. Any chance you can
try testing on x86 tooa nd see whether it produces there?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-29 17:25     ` Tejun Heo
@ 2026-05-29 21:08       ` Mark Brown
  2026-05-31  9:19         ` Bert Karwatzki
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Brown @ 2026-05-29 21:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior,
	Petr Malat, Bert Karwatzki, kernel test robot, Martin Pitt,
	cgroups, linux-kernel, Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 736 bytes --]

On Fri, May 29, 2026 at 07:25:29AM -1000, Tejun Heo wrote:
> On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> > On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:

> > with no further output and given that this is a cgroup locking change
> > this does seem like a plausible commmit, though I didn't look into it in
> > detail.  Bisect log and the list of LTP tests we're running in our test
> > job below.  We are running multuple tests in parallel.

> Unfortunately, I can't reproduce this in my environment. Any chance you can
> try testing on x86 tooa nd see whether it produces there?

Not readily sadly, I'll see if I can figure something out.  Our rootfs
images are based on Debian Trixie if that's relevant?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-29 21:08       ` Mark Brown
@ 2026-05-31  9:19         ` Bert Karwatzki
  2026-05-31 18:45           ` Bert Karwatzki
  0 siblings, 1 reply; 19+ messages in thread
From: Bert Karwatzki @ 2026-05-31  9:19 UTC (permalink / raw)
  To: Mark Brown, Tejun Heo
  Cc: Johannes Weiner, spasswolf@web.de Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, cgroups, linux-kernel, Aishwarya.TCV

Am Freitag, dem 29.05.2026 um 22:08 +0100 schrieb Mark Brown:
> On Fri, May 29, 2026 at 07:25:29AM -1000, Tejun Heo wrote:
> > On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> > > On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:
> 
> > > with no further output and given that this is a cgroup locking change
> > > this does seem like a plausible commmit, though I didn't look into it in
> > > detail.  Bisect log and the list of LTP tests we're running in our test
> > > job below.  We are running multuple tests in parallel.
> 
> > Unfortunately, I can't reproduce this in my environment. Any chance you can
> > try testing on x86 tooa nd see whether it produces there?
> 
> Not readily sadly, I'll see if I can figure something out.  Our rootfs
> images are based on Debian Trixie if that's relevant?

Using debian unstable (sid/forky) I can at least detect a timeout when running
the ltp controller testsuite:

# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
Host information
 Hostname: homer
 Python: 3.13.12 (main, Feb 4 2026, 15:06:39) [GCC 15.2.0]
 Directory: /tmp/kirk.root/tmp092in2yb

Connecting to SUT: default

Suite: controllers
──────────────────
cgroup_core01: pass  (0.024s)
cgroup_core02: pass  (0.004s)
cgroup_core03: pass  (0.017s)
cgroup: skip  (2m 41s)
memcg_regression: skip  (3.414s)
memcg_test_3: pass  (0.090s)
memcg_failcnt: skip  (0.019s)
memcg_force_empty: skip  (0.015s)
memcg_limit_in_bytes: skip  (0.017s)
memcg_stat_rss: skip  (0.015s)
memcg_subgroup_charge: skip  (0.015s)
memcg_max_usage_in_bytes: skip  (0.014s)
memcg_move_charge_at_immigrate: skip  (0.014s)
memcg_memsw_limit_in_bytes: skip  (0.015s)
memcg_stat: skip  (0.015s)
memcg_use_hierarchy: skip  (0.015s)
memcg_usage_in_bytes: skip  (0.014s)
memcg_stress: pass  (30m 4s)
memcg_control: pass  (6.058s)
memcontrol01: pass  (0.004s)
memcontrol02: pass  (0.636s)
memcontrol03: pass  (15.983s)
memcontrol04: pass  (0.890s)
cgroup_fj_function_debug: skip  (0.013s)
cgroup_fj_function_cpuset: skip  (0.044s)
cgroup_fj_function_cpu: skip  (0.050s)
cgroup_fj_function_cpuacct: pass  (0.052s)
cgroup_fj_function_memory: skip  (0.042s)
cgroup_fj_function_freezer: pass  (0.044s)
cgroup_fj_function_devices: pass  (0.066s)
cgroup_fj_function_blkio: skip  (0.009s)
cgroup_fj_function_net_cls: pass  (0.073s)
cgroup_fj_function_perf_event: pass  (0.072s)
cgroup_fj_function_net_prio: Suite 'controllers' timed out after 3600 seconds

Execution time: 1h 33m 13s

Disconnecting from SUT: default

Target information
──────────────────
Kernel:   Linux 7.1.0-rc5-next-20260528-master-dirty #480 SMP PREEMPT_RT Thu May 28 19:55:12 CEST 2026
Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc5-next-20260528-master-dirty
          root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
          ro
          quiet
Machine:  unknown
Arch:     x86_64
RAM:      63439380 kB
Swap:     78125052 kB
Distro:   debian 

────────────────────────
      TEST SUMMARY
────────────────────────
Suite:   controllers
Runtime: 33m 13s
Runs:    347

Results:
    Passed:   181
    Failed:   0
    Broken:   0
    Skipped:  350
    Warnings: 0

Session stopped

In dmesg I get messages about task tst_cgtl hanging:

[ 2212.794669] [    T346] INFO: task tst_cgctl:317896 blocked for more than 122 seconds.
[ 2212.794674] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
[ 2212.794675] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[...] 

[ 3318.721344] [    T346] INFO: task tst_cgctl:317896 blocked for more than 1228 seconds.
[ 3318.721349] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
[ 3318.721351] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.






On 6.19.14 the Results of this testrun is:

# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers

[...]

Target information
──────────────────
Kernel:   Linux 6.19.14-stable #1238 SMP PREEMPT_RT Sat May 30 17:28:29 CEST 2026
Cmdline:  BOOT_IMAGE=/boot/vmlinuz-6.19.14-stable
          root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
          ro
          quiet
Machine:  unknown
Arch:     x86_64
RAM:      63436188 kB
Swap:     78125052 kB
Distro:   debian 

────────────────────────
      TEST SUMMARY
────────────────────────
Suite:   controllers
Runtime: 36m 12s
Runs:    347

Results:
    Passed:   1742
    Failed:   0
    Broken:   0
    Skipped:  97
    Warnings: 0

Session stopped

With 6.19.14 I also get no hung tasks.

On 7.0.10 the tests also work:

root@homer:/mnt/data/linux-forest/kirk# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
Host information
	Hostname:   homer
	Python:     3.13.12 (main, Feb  4 2026, 15:06:39) [GCC 15.2.0]
	Directory:  /tmp/kirk.root/tmpq32b09g7

Connecting to SUT: default

Suite: controllers
──────────────────
cgroup_core01: pass  (0.016s)

[...]

pids_9_100: pass  (0.107s)

Execution time: 36m 15s

Disconnecting from SUT: default

Target information
──────────────────
Kernel:   Linux 7.0.10-stable #1239 SMP PREEMPT_RT Sun May 31 00:42:41 CEST 2026
Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.0.10-stable
          root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
          ro
          quiet
Machine:  unknown
Arch:     x86_64
RAM:      63435940 kB
Swap:     78125052 kB
Distro:   debian 

────────────────────────
      TEST SUMMARY
────────────────────────
Suite:   controllers
Runtime: 36m 13s
Runs:    347

Results:
    Passed:   1742
    Failed:   0
    Broken:   0
    Skipped:  97
    Warnings: 0

Session stopped



I'm not sure if this is related to the problems on arm64, but I'll try bisecting this.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-31  9:19         ` Bert Karwatzki
@ 2026-05-31 18:45           ` Bert Karwatzki
  2026-06-01  9:22             ` Bert Karwatzki
  0 siblings, 1 reply; 19+ messages in thread
From: Bert Karwatzki @ 2026-05-31 18:45 UTC (permalink / raw)
  To: Mark Brown, Tejun Heo
  Cc: Johannes Weiner, spasswolf, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, cgroups, linux-kernel, Aishwarya.TCV

Am Sonntag, dem 31.05.2026 um 11:19 +0200 schrieb Bert Karwatzki:
> Am Freitag, dem 29.05.2026 um 22:08 +0100 schrieb Mark Brown:
> > On Fri, May 29, 2026 at 07:25:29AM -1000, Tejun Heo wrote:
> > > On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> > > > On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:
> > 
> > > > with no further output and given that this is a cgroup locking change
> > > > this does seem like a plausible commmit, though I didn't look into it in
> > > > detail.  Bisect log and the list of LTP tests we're running in our test
> > > > job below.  We are running multuple tests in parallel.
> > 
> > > Unfortunately, I can't reproduce this in my environment. Any chance you can
> > > try testing on x86 tooa nd see whether it produces there?
> > 
> > Not readily sadly, I'll see if I can figure something out.  Our rootfs
> > images are based on Debian Trixie if that's relevant?
> 
> Using debian unstable (sid/forky) I can at least detect a timeout when running
> the ltp controller testsuite:
> 
> # LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> Host information
>  Hostname: homer
>  Python: 3.13.12 (main, Feb 4 2026, 15:06:39) [GCC 15.2.0]
>  Directory: /tmp/kirk.root/tmp092in2yb
> 
> Connecting to SUT: default
> 
> Suite: controllers
> ──────────────────
> cgroup_core01: pass  (0.024s)
> cgroup_core02: pass  (0.004s)
> cgroup_core03: pass  (0.017s)
> cgroup: skip  (2m 41s)
> memcg_regression: skip  (3.414s)
> memcg_test_3: pass  (0.090s)
> memcg_failcnt: skip  (0.019s)
> memcg_force_empty: skip  (0.015s)
> memcg_limit_in_bytes: skip  (0.017s)
> memcg_stat_rss: skip  (0.015s)
> memcg_subgroup_charge: skip  (0.015s)
> memcg_max_usage_in_bytes: skip  (0.014s)
> memcg_move_charge_at_immigrate: skip  (0.014s)
> memcg_memsw_limit_in_bytes: skip  (0.015s)
> memcg_stat: skip  (0.015s)
> memcg_use_hierarchy: skip  (0.015s)
> memcg_usage_in_bytes: skip  (0.014s)
> memcg_stress: pass  (30m 4s)
> memcg_control: pass  (6.058s)
> memcontrol01: pass  (0.004s)
> memcontrol02: pass  (0.636s)
> memcontrol03: pass  (15.983s)
> memcontrol04: pass  (0.890s)
> cgroup_fj_function_debug: skip  (0.013s)
> cgroup_fj_function_cpuset: skip  (0.044s)
> cgroup_fj_function_cpu: skip  (0.050s)
> cgroup_fj_function_cpuacct: pass  (0.052s)
> cgroup_fj_function_memory: skip  (0.042s)
> cgroup_fj_function_freezer: pass  (0.044s)
> cgroup_fj_function_devices: pass  (0.066s)
> cgroup_fj_function_blkio: skip  (0.009s)
> cgroup_fj_function_net_cls: pass  (0.073s)
> cgroup_fj_function_perf_event: pass  (0.072s)
> 
> 
> Execution time: 1h 33m 13s
> 
> Disconnecting from SUT: default
> 
> Target information
> ──────────────────
> Kernel:   Linux 7.1.0-rc5-next-20260528-master-dirty #480 SMP PREEMPT_RT Thu May 28 19:55:12 CEST 2026
> Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc5-next-20260528-master-dirty
>           root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
>           ro
>           quiet
> Machine:  unknown
> Arch:     x86_64
> RAM:      63439380 kB
> Swap:     78125052 kB
> Distro:   debian 
> 
> ────────────────────────
>       TEST SUMMARY
> ────────────────────────
> Suite:   controllers
> Runtime: 33m 13s
> Runs:    347
> 
> Results:
>     Passed:   181
>     Failed:   0
>     Broken:   0
>     Skipped:  350
>     Warnings: 0
> 
> Session stopped
> 
> In dmesg I get messages about task tst_cgtl hanging:
> 
> [ 2212.794669] [    T346] INFO: task tst_cgctl:317896 blocked for more than 122 seconds.
> [ 2212.794674] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
> [ 2212.794675] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> [...] 
> 
> [ 3318.721344] [    T346] INFO: task tst_cgctl:317896 blocked for more than 1228 seconds.
> [ 3318.721349] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
> [ 3318.721351] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> 
> 
> 
> 
> 
> On 6.19.14 the Results of this testrun is:
> 
> # LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> 
> [...]
> 
> Target information
> ──────────────────
> Kernel:   Linux 6.19.14-stable #1238 SMP PREEMPT_RT Sat May 30 17:28:29 CEST 2026
> Cmdline:  BOOT_IMAGE=/boot/vmlinuz-6.19.14-stable
>           root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
>           ro
>           quiet
> Machine:  unknown
> Arch:     x86_64
> RAM:      63436188 kB
> Swap:     78125052 kB
> Distro:   debian 
> 
> ────────────────────────
>       TEST SUMMARY
> ────────────────────────
> Suite:   controllers
> Runtime: 36m 12s
> Runs:    347
> 
> Results:
>     Passed:   1742
>     Failed:   0
>     Broken:   0
>     Skipped:  97
>     Warnings: 0
> 
> Session stopped
> 
> With 6.19.14 I also get no hung tasks.
> 
> On 7.0.10 the tests also work:
> 
> root@homer:/mnt/data/linux-forest/kirk# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> Host information
> 	Hostname:   homer
> 	Python:     3.13.12 (main, Feb  4 2026, 15:06:39) [GCC 15.2.0]
> 	Directory:  /tmp/kirk.root/tmpq32b09g7
> 
> Connecting to SUT: default
> 
> Suite: controllers
> ──────────────────
> cgroup_core01: pass  (0.016s)
> 
> [...]
> 
> pids_9_100: pass  (0.107s)
> 
> Execution time: 36m 15s
> 
> Disconnecting from SUT: default
> 
> Target information
> ──────────────────
> Kernel:   Linux 7.0.10-stable #1239 SMP PREEMPT_RT Sun May 31 00:42:41 CEST 2026
> Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.0.10-stable
>           root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
>           ro
>           quiet
> Machine:  unknown
> Arch:     x86_64
> RAM:      63435940 kB
> Swap:     78125052 kB
> Distro:   debian 
> 
> ────────────────────────
>       TEST SUMMARY
> ────────────────────────
> Suite:   controllers
> Runtime: 36m 13s
> Runs:    347
> 
> Results:
>     Passed:   1742
>     Failed:   0
>     Broken:   0
>     Skipped:  97
>     Warnings: 0
> 
> Session stopped
> 
> 
> 
> I'm not sure if this is related to the problems on arm64, but I'll try bisecting this.
> 
> Bert Karwatzki

I finished my bisectiOn (from v7.0.0 to next-20260528) and it shows 

commit 1dffd95575eb ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()")

as first bad commit, too. During the bisection I had to apply this patch (when it's cleanly applicable)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 771fc31a69b8..712316a1e3e0 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -269,7 +269,7 @@ static __cold noinline int regen_filesystems_string(void)
 	hlist_for_each_entry_rcu(p, &file_systems, list) {
 		if (!(p->fs_flags & FS_REQUIRES_DEV))
 			newlen += strlen("nodev");
-		newlen += strlen("\t") + strlen(p->name) +  strlen("\n");
+		newlen += strlen("\t") + strlen(p->name) + strlen("\n");
 	}
 	spin_unlock(&file_systems_lock);
 
@@ -289,6 +289,7 @@ static __cold noinline int regen_filesystems_string(void)
 	 * Did someone beat us to it?
 	 */
 	if (old && old->gen == file_systems_gen) {
+		spin_unlock(&file_systems_lock);
 		kfree(new);
 		return 0;
 	}
@@ -297,6 +298,7 @@ static __cold noinline int regen_filesystems_string(void)
 	 * Did the list change in the meantime?
 	 */
 	if (gen != file_systems_gen) {
+		spin_unlock(&file_systems_lock);
 		kfree(new);
 		goto retry;
 	}
@@ -321,13 +323,12 @@ static __cold noinline int regen_filesystems_string(void)
 		 * generation above and messes it up.
 		 */
 		spin_unlock(&file_systems_lock);
-		if (old)
-			kfree_rcu(old, rcu);
+		kfree(new);
 		return -EINVAL;
 	}
 
 	/*
-	 * Paired with consume fence in READ_ONCE() in filesystems_proc_show()
+	 * Paired with consume fence in rcu_dereference() in filesystems_proc_show()
 	 */
 	smp_store_release(&file_systems_string, new);
 	spin_unlock(&file_systems_lock);


to take care of a locking issue in commit
36b3306779ea ("fs: cache the string generated by reading /proc/filesystems")
https://lore.kernel.org/all/20260520225245.2962-1-spasswolf@web.de/

The test that hang when running
# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
is always  cgroup_fj_function_net_prio.
Also when bisecting this I disabled (i.e. commented out) the
memcg_stress test in ~/ltp-install/runtest/controllers as it takes a lot of
time (30min) and succeeds even in the version where hangs occur.

Bert Karwatzki

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
  2026-05-31 18:45           ` Bert Karwatzki
@ 2026-06-01  9:22             ` Bert Karwatzki
  2026-06-01 19:02               ` [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Bert Karwatzki @ 2026-06-01  9:22 UTC (permalink / raw)
  To: Mark Brown, Tejun Heo
  Cc: Johannes Weiner, spasswolf, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, cgroups, linux-kernel, Aishwarya.TCV

Am Sonntag, dem 31.05.2026 um 20:45 +0200 schrieb Bert Karwatzki:
> 
> The test that hang when running
> # LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> is always  cgroup_fj_function_net_prio.
> Also when bisecting this I disabled (i.e. commented out) the
> memcg_stress test in ~/ltp-install/runtest/controllers as it takes a lot of
> time (30min) and succeeds even in the version where hangs occur.
> 
> Bert Karwatzki

I've done more testing and found that running the
cgroup_fj_function_net_prio test alone gives no hang, the hang
only occurs when other tests are run before it:

Suite: controllers
──────────────────
cgroup_core01: pass  (0.026s)
cgroup_core02: pass  (0.004s)
cgroup_core03: pass  (0.005s)
cgroup: fail  (2m 41s)
memcg_regression: skip  (3.558s)
memcg_test_3: pass  (0.112s)
memcg_failcnt: skip  (0.027s)
memcg_force_empty: skip  (0.016s)
memcg_limit_in_bytes: skip  (0.015s)
memcg_stat_rss: skip  (0.015s)
memcg_subgroup_charge: skip  (0.015s)
memcg_max_usage_in_bytes: skip  (0.014s)
memcg_move_charge_at_immigrate: skip  (0.015s)
memcg_memsw_limit_in_bytes: skip  (0.015s)
memcg_stat: skip  (0.014s)
memcg_use_hierarchy: skip  (0.015s)
memcg_usage_in_bytes: skip  (0.014s)
memcg_control: pass  (6.046s)
memcontrol01: pass  (0.004s)
memcontrol02: pass  (0.628s)
memcontrol03: pass  (16.009s)
memcontrol04: pass  (0.926s)
cgroup_fj_function_debug: skip  (0.012s)
cgroup_fj_function_cpuset: skip  (0.037s)
cgroup_fj_function_cpu: skip  (0.055s)
cgroup_fj_function_cpuacct: pass  (0.046s)
cgroup_fj_function_memory: skip  (0.035s)
cgroup_fj_function_freezer: pass  (0.044s)
cgroup_fj_function_devices: pass  (0.067s)
cgroup_fj_function_blkio: skip  (0.010s)
cgroup_fj_function_net_cls: pass  (0.055s)
cgroup_fj_function_perf_event: pass  (0.063s)
cgroup_fj_function_net_prio: HANG 

I tried to narrow down this list and found that a hang occurs
int the net_prio test only if the perf_event test is run before it:

cgroup_fj_function_perf_event: pass  (0.063s)
cgroup_fj_function_net_prio: HANG


Bert Karwatzki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound
  2026-06-01  9:22             ` Bert Karwatzki
@ 2026-06-01 19:02               ` Tejun Heo
  2026-06-01 19:07                 ` Bert Karwatzki
                                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Tejun Heo @ 2026-06-01 19:02 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Mark Brown, Bert Karwatzki, Johannes Weiner, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, Aishwarya.TCV, Tejun Heo

cgroup_apply_control_disable() defers kill_css_finish() while a css is
still populated, relying on css_update_populated() to fire the deferred
kill once the populated count reaches zero.

This deadlocks when a controller is rebound out of a hierarchy. Mounting
an implicit_on_dfl controller such as perf_event as a v1 hierarchy steals
it off the default hierarchy, and rebind_subsystems() kills its
per-cgroup csses while they are still populated. The migration run in the
same step keeps the old css for a controller no longer in the hierarchy's
mask, so no task is migrated off the dying csses. Their populated count
never reaches zero, the deferred kill_css_finish() never fires, and the
next cgroup_lock_and_drain_offline() hangs forever under cgroup_mutex.

That migration is already a no-op pass over the rebound subtree. Add
cgroup_rebind_ss_mask so find_existing_css_set() resolves the leaving
controllers to the root css. Their tasks are migrated there, the
per-cgroup csses depopulate, and cgroup_apply_control_disable() kills
them synchronously. The deferral stays correct for the rmdir and
controller-disable paths it was meant for.

Fixes: 1dffd95575eb ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()")
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/41cd159c-54e5-45e0-81df-eaf36a6c028e@sirena.org.uk/
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/4e986b4ed7e16547805d54b6e67d09120bc4d2f2.camel@web.de/
Signed-off-by: Tejun Heo <tj@kernel.org>
---
Hello, and thanks a lot for all the reproduction information. It made this
much easier to track down.

Bert, Mark, would you mind giving this a try on your setups?

 kernel/cgroup/cgroup.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index bdc8deedb4f7..7f4861109e48 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -197,6 +197,14 @@ static u32 cgrp_dfl_implicit_ss_mask;
 /* some controllers can be threaded on the default hierarchy */
 static u32 cgrp_dfl_threaded_ss_mask;
 
+/*
+ * Set across rebind_subsystems() to the controllers leaving a hierarchy.
+ * Guarded by cgroup_mutex. Makes find_existing_css_set() resolve them to the
+ * root css so the affected tasks are migrated there before
+ * cgroup_apply_control_disable() kills the per-cgroup csses.
+ */
+static u32 cgroup_rebind_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -1083,7 +1091,15 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
 	 * won't change, so no need for locking.
 	 */
 	for_each_subsys(ss, i) {
-		if (root->subsys_mask & (1UL << i)) {
+		if (unlikely(cgroup_rebind_ss_mask & (1UL << i))) {
+			/*
+			 * @ss is leaving this hierarchy and its per-cgroup
+			 * csses are about to be killed. Resolve to the
+			 * surviving root css so the tasks are migrated there.
+			 */
+			template[i] = cgroup_css(&root->cgrp, ss);
+			WARN_ON_ONCE(!template[i]);
+		} else if (root->subsys_mask & (1UL << i)) {
 			/*
 			 * @ss is in this hierarchy, so we want the
 			 * effective css from @cgrp.
@@ -1853,11 +1869,17 @@ int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
 		struct cgroup *scgrp = &cgrp_dfl_root.cgrp;
 
 		/*
-		 * Controllers from default hierarchy that need to be rebound
-		 * are all disabled together in one go.
+		 * Controllers leaving the default hierarchy are disabled
+		 * together. cgroup_rebind_ss_mask makes cgroup_apply_control()
+		 * migrate their tasks to the root css, so the per-cgroup csses
+		 * are unpopulated when cgroup_finalize_control() kills them.
+		 * Clear it before cgroup_finalize_control(), which does no
+		 * css_set lookup.
 		 */
 		cgrp_dfl_root.subsys_mask &= ~dfl_disable_ss_mask;
+		cgroup_rebind_ss_mask = dfl_disable_ss_mask;
 		WARN_ON(cgroup_apply_control(scgrp));
+		cgroup_rebind_ss_mask = 0;
 		cgroup_finalize_control(scgrp, 0);
 	}
 
@@ -1871,9 +1893,14 @@ int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
 		WARN_ON(!css || cgroup_css(dcgrp, ss));
 
 		if (src_root != &cgrp_dfl_root) {
-			/* disable from the source */
+			/*
+			 * Disable from the source, migrating its tasks to the
+			 * root css first (see cgroup_rebind_ss_mask).
+			 */
 			src_root->subsys_mask &= ~(1 << ssid);
+			cgroup_rebind_ss_mask = 1 << ssid;
 			WARN_ON(cgroup_apply_control(scgrp));
+			cgroup_rebind_ss_mask = 0;
 			cgroup_finalize_control(scgrp, 0);
 		}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound
  2026-06-01 19:02               ` [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound Tejun Heo
@ 2026-06-01 19:07                 ` Bert Karwatzki
  2026-06-01 19:50                   ` Bert Karwatzki
  2026-06-02 16:28                 ` Mark Brown
  2026-06-02 18:34                 ` Tejun Heo
  2 siblings, 1 reply; 19+ messages in thread
From: Bert Karwatzki @ 2026-06-01 19:07 UTC (permalink / raw)
  To: Tejun Heo, cgroups, linux-kernel
  Cc: Mark Brown, spasswolf, Johannes Weiner, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, Aishwarya.TCV

Am Montag, dem 01.06.2026 um 09:02 -1000 schrieb Tejun Heo:
> cgroup_apply_control_disable() defers kill_css_finish() while a css is
> still populated, relying on css_update_populated() to fire the deferred
> kill once the populated count reaches zero.
> 
> This deadlocks when a controller is rebound out of a hierarchy. Mounting
> an implicit_on_dfl controller such as perf_event as a v1 hierarchy steals
> it off the default hierarchy, and rebind_subsystems() kills its
> per-cgroup csses while they are still populated. The migration run in the
> same step keeps the old css for a controller no longer in the hierarchy's
> mask, so no task is migrated off the dying csses. Their populated count
> never reaches zero, the deferred kill_css_finish() never fires, and the
> next cgroup_lock_and_drain_offline() hangs forever under cgroup_mutex.
> 
> That migration is already a no-op pass over the rebound subtree. Add
> cgroup_rebind_ss_mask so find_existing_css_set() resolves the leaving
> controllers to the root css. Their tasks are migrated there, the
> per-cgroup csses depopulate, and cgroup_apply_control_disable() kills
> them synchronously. The deferral stays correct for the rmdir and
> controller-disable paths it was meant for.
> 
> Fixes: 1dffd95575eb ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()")
> Reported-by: Mark Brown <broonie@kernel.org>
> Closes: https://lore.kernel.org/all/41cd159c-54e5-45e0-81df-eaf36a6c028e@sirena.org.uk/
> Reported-by: Bert Karwatzki <spasswolf@web.de>
> Closes: https://lore.kernel.org/all/4e986b4ed7e16547805d54b6e67d09120bc4d2f2.camel@web.de/
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> Hello, and thanks a lot for all the reproduction information. It made this
> much easier to track down.
> 
> Bert, Mark, would you mind giving this a try on your setups?
> 
>  kernel/cgroup/cgroup.c | 35 +++++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index bdc8deedb4f7..7f4861109e48 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -197,6 +197,14 @@ static u32 cgrp_dfl_implicit_ss_mask;
>  /* some controllers can be threaded on the default hierarchy */
>  static u32 cgrp_dfl_threaded_ss_mask;
>  
> +/*
> + * Set across rebind_subsystems() to the controllers leaving a hierarchy.
> + * Guarded by cgroup_mutex. Makes find_existing_css_set() resolve them to the
> + * root css so the affected tasks are migrated there before
> + * cgroup_apply_control_disable() kills the per-cgroup csses.
> + */
> +static u32 cgroup_rebind_ss_mask;
> +
>  /* The list of hierarchy roots */
>  LIST_HEAD(cgroup_roots);
>  static int cgroup_root_count;
> @@ -1083,7 +1091,15 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
>  	 * won't change, so no need for locking.
>  	 */
>  	for_each_subsys(ss, i) {
> -		if (root->subsys_mask & (1UL << i)) {
> +		if (unlikely(cgroup_rebind_ss_mask & (1UL << i))) {
> +			/*
> +			 * @ss is leaving this hierarchy and its per-cgroup
> +			 * csses are about to be killed. Resolve to the
> +			 * surviving root css so the tasks are migrated there.
> +			 */
> +			template[i] = cgroup_css(&root->cgrp, ss);
> +			WARN_ON_ONCE(!template[i]);
> +		} else if (root->subsys_mask & (1UL << i)) {
>  			/*
>  			 * @ss is in this hierarchy, so we want the
>  			 * effective css from @cgrp.
> @@ -1853,11 +1869,17 @@ int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
>  		struct cgroup *scgrp = &cgrp_dfl_root.cgrp;
>  
>  		/*
> -		 * Controllers from default hierarchy that need to be rebound
> -		 * are all disabled together in one go.
> +		 * Controllers leaving the default hierarchy are disabled
> +		 * together. cgroup_rebind_ss_mask makes cgroup_apply_control()
> +		 * migrate their tasks to the root css, so the per-cgroup csses
> +		 * are unpopulated when cgroup_finalize_control() kills them.
> +		 * Clear it before cgroup_finalize_control(), which does no
> +		 * css_set lookup.
>  		 */
>  		cgrp_dfl_root.subsys_mask &= ~dfl_disable_ss_mask;
> +		cgroup_rebind_ss_mask = dfl_disable_ss_mask;
>  		WARN_ON(cgroup_apply_control(scgrp));
> +		cgroup_rebind_ss_mask = 0;
>  		cgroup_finalize_control(scgrp, 0);
>  	}
>  
> @@ -1871,9 +1893,14 @@ int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
>  		WARN_ON(!css || cgroup_css(dcgrp, ss));
>  
>  		if (src_root != &cgrp_dfl_root) {
> -			/* disable from the source */
> +			/*
> +			 * Disable from the source, migrating its tasks to the
> +			 * root css first (see cgroup_rebind_ss_mask).
> +			 */
>  			src_root->subsys_mask &= ~(1 << ssid);
> +			cgroup_rebind_ss_mask = 1 << ssid;
>  			WARN_ON(cgroup_apply_control(scgrp));
> +			cgroup_rebind_ss_mask = 0;
>  			cgroup_finalize_control(scgrp, 0);
>  		}
>  

I'll try this right away, but I found out another thing. My real problem seems
to be the perf_event test, the test after perf_events hangs, no matter what
test I run:

cgroup_fj_function_perf_event: pass  (0.206s)
cgroup_core01: HANG 

Bert Karwatzki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound
  2026-06-01 19:07                 ` Bert Karwatzki
@ 2026-06-01 19:50                   ` Bert Karwatzki
  0 siblings, 0 replies; 19+ messages in thread
From: Bert Karwatzki @ 2026-06-01 19:50 UTC (permalink / raw)
  To: Tejun Heo, cgroups, linux-kernel
  Cc: Mark Brown, spasswolf, Johannes Weiner, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, Aishwarya.TCV

Am Montag, dem 01.06.2026 um 21:07 +0200 schrieb Bert Karwatzki:
> Am Montag, dem 01.06.2026 um 09:02 -1000 schrieb Tejun Heo:
> > cgroup_apply_control_disable() defers kill_css_finish() while a css is
> > still populated, relying on css_update_populated() to fire the deferred
> > kill once the populated count reaches zero.
> > 
> > This deadlocks when a controller is rebound out of a hierarchy. Mounting
> > an implicit_on_dfl controller such as perf_event as a v1 hierarchy steals
> > it off the default hierarchy, and rebind_subsystems() kills its
> > per-cgroup csses while they are still populated. The migration run in the
> > same step keeps the old css for a controller no longer in the hierarchy's
> > mask, so no task is migrated off the dying csses. Their populated count
> > never reaches zero, the deferred kill_css_finish() never fires, and the
> > next cgroup_lock_and_drain_offline() hangs forever under cgroup_mutex.
> > 
> > That migration is already a no-op pass over the rebound subtree. Add
> > cgroup_rebind_ss_mask so find_existing_css_set() resolves the leaving
> > controllers to the root css. Their tasks are migrated there, the
> > per-cgroup csses depopulate, and cgroup_apply_control_disable() kills
> > them synchronously. The deferral stays correct for the rmdir and
> > controller-disable paths it was meant for.
> > 
> > Fixes: 1dffd95575eb ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()")
> > Reported-by: Mark Brown <broonie@kernel.org>
> > Closes: https://lore.kernel.org/all/41cd159c-54e5-45e0-81df-eaf36a6c028e@sirena.org.uk/
> > Reported-by: Bert Karwatzki <spasswolf@web.de>
> > Closes: https://lore.kernel.org/all/4e986b4ed7e16547805d54b6e67d09120bc4d2f2.camel@web.de/
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> > Hello, and thanks a lot for all the reproduction information. It made this
> > much easier to track down.
> > 
> > Bert, Mark, would you mind giving this a try on your setups?
> > 
> >  kernel/cgroup/cgroup.c | 35 +++++++++++++++++++++++++++++++----
> >  1 file changed, 31 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index bdc8deedb4f7..7f4861109e48 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -197,6 +197,14 @@ static u32 cgrp_dfl_implicit_ss_mask;
> >  /* some controllers can be threaded on the default hierarchy */
> >  static u32 cgrp_dfl_threaded_ss_mask;
> >  
> > +/*
> > + * Set across rebind_subsystems() to the controllers leaving a hierarchy.
> > + * Guarded by cgroup_mutex. Makes find_existing_css_set() resolve them to the
> > + * root css so the affected tasks are migrated there before
> > + * cgroup_apply_control_disable() kills the per-cgroup csses.
> > + */
> > +static u32 cgroup_rebind_ss_mask;
> > +
> >  /* The list of hierarchy roots */
> >  LIST_HEAD(cgroup_roots);
> >  static int cgroup_root_count;
> > @@ -1083,7 +1091,15 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
> >  	 * won't change, so no need for locking.
> >  	 */
> >  	for_each_subsys(ss, i) {
> > -		if (root->subsys_mask & (1UL << i)) {
> > +		if (unlikely(cgroup_rebind_ss_mask & (1UL << i))) {
> > +			/*
> > +			 * @ss is leaving this hierarchy and its per-cgroup
> > +			 * csses are about to be killed. Resolve to the
> > +			 * surviving root css so the tasks are migrated there.
> > +			 */
> > +			template[i] = cgroup_css(&root->cgrp, ss);
> > +			WARN_ON_ONCE(!template[i]);
> > +		} else if (root->subsys_mask & (1UL << i)) {
> >  			/*
> >  			 * @ss is in this hierarchy, so we want the
> >  			 * effective css from @cgrp.
> > @@ -1853,11 +1869,17 @@ int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
> >  		struct cgroup *scgrp = &cgrp_dfl_root.cgrp;
> >  
> >  		/*
> > -		 * Controllers from default hierarchy that need to be rebound
> > -		 * are all disabled together in one go.
> > +		 * Controllers leaving the default hierarchy are disabled
> > +		 * together. cgroup_rebind_ss_mask makes cgroup_apply_control()
> > +		 * migrate their tasks to the root css, so the per-cgroup csses
> > +		 * are unpopulated when cgroup_finalize_control() kills them.
> > +		 * Clear it before cgroup_finalize_control(), which does no
> > +		 * css_set lookup.
> >  		 */
> >  		cgrp_dfl_root.subsys_mask &= ~dfl_disable_ss_mask;
> > +		cgroup_rebind_ss_mask = dfl_disable_ss_mask;
> >  		WARN_ON(cgroup_apply_control(scgrp));
> > +		cgroup_rebind_ss_mask = 0;
> >  		cgroup_finalize_control(scgrp, 0);
> >  	}
> >  
> > @@ -1871,9 +1893,14 @@ int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask)
> >  		WARN_ON(!css || cgroup_css(dcgrp, ss));
> >  
> >  		if (src_root != &cgrp_dfl_root) {
> > -			/* disable from the source */
> > +			/*
> > +			 * Disable from the source, migrating its tasks to the
> > +			 * root css first (see cgroup_rebind_ss_mask).
> > +			 */
> >  			src_root->subsys_mask &= ~(1 << ssid);
> > +			cgroup_rebind_ss_mask = 1 << ssid;
> >  			WARN_ON(cgroup_apply_control(scgrp));
> > +			cgroup_rebind_ss_mask = 0;
> >  			cgroup_finalize_control(scgrp, 0);
> >  		}
> >  
> 
> 
> Bert Karwatzki

Your fix works for me. No more hangs after cgroup_fj_function_perf_event is run.
Let's hope this solves Mark's problems, too.

Tested-By: Bert Karwatzki <spasswolf@web.de>

Bert Karwatzki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound
  2026-06-01 19:02               ` [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound Tejun Heo
  2026-06-01 19:07                 ` Bert Karwatzki
@ 2026-06-02 16:28                 ` Mark Brown
  2026-06-02 18:34                 ` Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Mark Brown @ 2026-06-02 16:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-kernel, Bert Karwatzki, Johannes Weiner,
	Michal Koutný, Sebastian Andrzej Siewior, Petr Malat,
	kernel test robot, Martin Pitt, Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 343 bytes --]

On Mon, Jun 01, 2026 at 09:02:56AM -1000, Tejun Heo wrote:
> cgroup_apply_control_disable() defers kill_css_finish() while a css is
> still populated, relying on css_update_populated() to fire the deferred
> kill once the populated count reaches zero.

This seems to fix things for me, thanks both!

Tested-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound
  2026-06-01 19:02               ` [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound Tejun Heo
  2026-06-01 19:07                 ` Bert Karwatzki
  2026-06-02 16:28                 ` Mark Brown
@ 2026-06-02 18:34                 ` Tejun Heo
  2 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-06-02 18:34 UTC (permalink / raw)
  To: cgroups, linux-kernel
  Cc: Mark Brown, Bert Karwatzki, Johannes Weiner, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, Aishwarya.TCV

Applied to cgroup/for-7.2.

Thanks Mark and Bert for the reports and the testing.

--
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
                   ` (4 preceding siblings ...)
  2026-05-05  0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
@ 2026-05-13 21:01 ` Tejun Heo
  2026-05-15 17:28 ` Tejun Heo
  6 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-13 21:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel

Michal, any thoughts?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral
  2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
                   ` (5 preceding siblings ...)
  2026-05-13 21:01 ` [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
@ 2026-05-15 17:28 ` Tejun Heo
  6 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2026-05-15 17:28 UTC (permalink / raw)
  To: Johannes Weiner, Michal Koutný
  Cc: Sebastian Andrzej Siewior, Petr Malat, Bert Karwatzki,
	kernel test robot, Martin Pitt, cgroups, linux-kernel

Hello,

> Tejun Heo (5):
>   cgroup: Inline cgroup_has_tasks() in cgroup.h
>   cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE
>   cgroup: Move populated counters to cgroup_subsys_state
>   cgroup: Add per-subsys-css kill_css_finish deferral
>   cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()

Applied 1-5 to cgroup/for-7.2.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-06-02 18:34 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05  0:51 [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
2026-05-05  0:51 ` [PATCH 1/5] cgroup: Inline cgroup_has_tasks() in cgroup.h Tejun Heo
2026-05-05  0:51 ` [PATCH 2/5] cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE Tejun Heo
2026-05-05  0:51 ` [PATCH 3/5] cgroup: Move populated counters to cgroup_subsys_state Tejun Heo
2026-05-05  0:51 ` [PATCH 4/5] cgroup: Add per-subsys-css kill_css_finish deferral Tejun Heo
2026-05-05  0:51 ` [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() Tejun Heo
2026-05-27 10:45   ` Mark Brown
2026-05-29 17:25     ` Tejun Heo
2026-05-29 21:08       ` Mark Brown
2026-05-31  9:19         ` Bert Karwatzki
2026-05-31 18:45           ` Bert Karwatzki
2026-06-01  9:22             ` Bert Karwatzki
2026-06-01 19:02               ` [PATCH] cgroup: Migrate tasks to the root css when a controller is rebound Tejun Heo
2026-06-01 19:07                 ` Bert Karwatzki
2026-06-01 19:50                   ` Bert Karwatzki
2026-06-02 16:28                 ` Mark Brown
2026-06-02 18:34                 ` Tejun Heo
2026-05-13 21:01 ` [PATCHSET cgroup/for-7.2] cgroup: Per-css kill_css_finish deferral Tejun Heo
2026-05-15 17:28 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox