public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues
@ 2026-02-21 18:54 Waiman Long
  2026-02-21 18:54 ` [PATCH v6 1/8] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

 v6:
  - Rebase on top of the latest v7.0 pre-RC linux tree.
  - Add another fix patch to fix found during code inspection.
  - Revert back to the v4 idea of just deferring the housekeeping_update()
    call to workqueue to make it simple as v5 change will add quite a
    bit more complexity to the cpuset code.

 v5:
  - https://lore.kernel.org/lkml/20260212164640.2408295-1-longman@redhat.com/

After booting the latest linux debug kernel with the latest cgroup
changes as well as Federic's "cpuset/isolation: Honour kthreads
preferred affinity" patch series [1] merged on top and running the
test-cpuset-prs.sh test, a circular locking dependency lockdep splat
was reported. See patch 5 for details.

To fix this issue, the cpuset code is modified to not call
housekeeping_update() with cpu_hotplug_lock held.  The cpuset hotplug
code is also modified to defer the housekeeping_update() call, if needed,
to workqueue.  A new top level cpuset_top_mutex is added to have more
exclusion control.

With these changes in place, the cpuset test ran to completion with no
failure and no lockdep splat.

[1] https://lore.kernel.org/lkml/20260125224541.50226-1-frederic@kernel.org/

Waiman Long (8):
  cgroup/cpuset: Fix incorrect change to effective_xcpus in
    partition_xcpus_del()
  cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in
    update_cpumasks_hier()
  cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is
    changed
  kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
  cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains()
    together
  cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to
    workqueue
  cgroup/cpuset: Call housekeeping_update() without holding
    cpus_read_lock

 kernel/cgroup/cpuset.c                        | 220 +++++++++++------
 kernel/sched/isolation.c                      |   4 +-
 kernel/time/timer_migration.c                 |   4 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 225 +++++++++---------
 4 files changed, 265 insertions(+), 188 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 1/8] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-21 18:54 ` [PATCH v6 2/8] cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() Waiman Long
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The effective_xcpus of a cpuset can contain offline CPUs. In
partition_xcpus_del(), the xcpus parameter is incorrectly used as
a temporary cpumask to mask out offline CPUs. As xcpus can be the
effective_xcpus of a cpuset, this can result in unexpected changes
in that cpumask. Fix this problem by not making any changes to the
xcpus parameter.

Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7607dfe516e6..4d10e320b144 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1220,8 +1220,8 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
 		isolated_cpus_update(old_prs, parent->partition_root_state,
 				     xcpus);
 
-	cpumask_and(xcpus, xcpus, cpu_active_mask);
 	cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
+	cpumask_and(parent->effective_cpus, parent->effective_cpus, cpu_active_mask);
 }
 
 /*
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 2/8] cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
  2026-02-21 18:54 ` [PATCH v6 1/8] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-21 18:54 ` [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask()
from tmp->new_cpus to cp->effective_cpus. This second parameter is just
a temporary cpumask for internal use. The cpuset_update_tasks_cpumask()
function was originally called update_tasks_cpumask() before commit
381b53c3b549 ("cgroup/cpuset: rename functions shared between v1
and v2").

This mistake can incorrectly change the effective_cpus of the
cpuset when it is the top_cpuset or in arm64 architecture where
task_cpu_possible_mask() may differ from cpu_possible_mask.  So far
top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64
arch can still be impacted. Fix it by reverting the incorrect change.

Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4d10e320b144..58660e06d322 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2156,7 +2156,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		WARN_ON(!is_in_v2_mode() &&
 			!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
 
-		cpuset_update_tasks_cpumask(cp, cp->effective_cpus);
+		cpuset_update_tasks_cpumask(cp, tmp->new_cpus);
 
 		/*
 		 * On default hierarchy, inherit the CS_SCHED_LOAD_BALANCE
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
  2026-02-21 18:54 ` [PATCH v6 1/8] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
  2026-02-21 18:54 ` [PATCH v6 2/8] cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-26 15:00   ` Frederic Weisbecker
  2026-02-21 18:54 ` [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

Clarify the locking rules associated with file level internal variables
inside the cpuset code. There is no functional change.

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
 1 file changed, 61 insertions(+), 44 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58660e06d322..e8c0b3cfd1f9 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -61,6 +61,58 @@ static const char * const perr_strings[] = {
 	[PERR_REMOTE]    = "Have remote partition underneath",
 };
 
+/*
+ * CPUSET Locking Convention
+ * -------------------------
+ *
+ * Below are the three global locks guarding cpuset structures in lock
+ * acquisition order:
+ *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
+ *  - cpuset_mutex
+ *  - callback_lock (raw spinlock)
+ *
+ * A task must hold all the three locks to modify externally visible or
+ * used fields of cpusets, though some of the internally used cpuset fields
+ * and internal variables can be modified without holding callback_lock. If only
+ * reliable read access of the externally used fields are needed, a task can
+ * hold either cpuset_mutex or callback_lock which are exposed to other
+ * external subsystems.
+ *
+ * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
+ * ensuring that it is the only task able to also acquire callback_lock and
+ * be able to modify cpusets.  It can perform various checks on the cpuset
+ * structure first, knowing nothing will change. It can also allocate memory
+ * without holding callback_lock. While it is performing these checks, various
+ * callback routines can briefly acquire callback_lock to query cpusets.  Once
+ * it is ready to make the changes, it takes callback_lock, blocking everyone
+ * else.
+ *
+ * Calls to the kernel memory allocator cannot be made while holding
+ * callback_lock which is a spinlock, as the memory allocator may sleep or
+ * call back into cpuset code and acquire callback_lock.
+ *
+ * Now, the task_struct fields mems_allowed and mempolicy may be changed
+ * by other task, we use alloc_lock in the task_struct fields to protect
+ * them.
+ *
+ * The cpuset_common_seq_show() handlers only hold callback_lock across
+ * small pieces of code, such as when reading out possibly multi-word
+ * cpumasks and nodemasks.
+ */
+
+static DEFINE_MUTEX(cpuset_mutex);
+
+/*
+ * File level internal variables below follow one of the following exclusion
+ * rules.
+ *
+ * RWCS: Read/write-able by holding either cpus_write_lock (and optionally
+ *	 cpuset_mutex) or both cpus_read_lock and cpuset_mutex.
+ *
+ * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
+ *	 by holding both cpuset_mutex and callback_lock.
+ */
+
 /*
  * For local partitions, update to subpartitions_cpus & isolated_cpus is done
  * in update_parent_effective_cpumask(). For remote partitions, it is done in
@@ -70,19 +122,18 @@ static const char * const perr_strings[] = {
  * Exclusive CPUs distributed out to local or remote sub-partitions of
  * top_cpuset
  */
-static cpumask_var_t	subpartitions_cpus;
+static cpumask_var_t	subpartitions_cpus;	/* RWCS */
 
 /*
- * Exclusive CPUs in isolated partitions
+ * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated)
  */
-static cpumask_var_t	isolated_cpus;
+static cpumask_var_t	isolated_cpus;		/* CSCB */
 
 /*
- * isolated_cpus updating flag (protected by cpuset_mutex)
- * Set if isolated_cpus is going to be updated in the current
- * cpuset_mutex crtical section.
+ * Set if isolated_cpus is being updated in the current cpuset_mutex
+ * critical section.
  */
-static bool isolated_cpus_updating;
+static bool		isolated_cpus_updating;	/* RWCS */
 
 /*
  * A flag to force sched domain rebuild at the end of an operation.
@@ -98,7 +149,7 @@ static bool isolated_cpus_updating;
  * Note that update_relax_domain_level() in cpuset-v1.c can still call
  * rebuild_sched_domains_locked() directly without using this flag.
  */
-static bool force_sd_rebuild;
+static bool force_sd_rebuild;			/* RWCS */
 
 /*
  * Partition root states:
@@ -218,42 +269,6 @@ struct cpuset top_cpuset = {
 	.partition_root_state = PRS_ROOT,
 };
 
-/*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
- * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
- * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
- * structures. Note that cpuset_mutex needs to be a mutex as it is used in
- * paths that rely on priority inheritance (e.g. scheduler - on RT) for
- * correctness.
- *
- * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_mutex, it blocks others, ensuring that it is the only task able to
- * also acquire callback_lock and be able to modify cpusets.  It can perform
- * various checks on the cpuset structure first, knowing nothing will change.
- * It can also allocate memory while just holding cpuset_mutex.  While it is
- * performing these checks, various callback routines can briefly acquire
- * callback_lock to query cpusets.  Once it is ready to make the changes, it
- * takes callback_lock, blocking everyone else.
- *
- * Calls to the kernel memory allocator can not be made while holding
- * callback_lock, as that would risk double tripping on callback_lock
- * from one of the callbacks into the cpuset code from within
- * __alloc_pages().
- *
- * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
- *
- * Now, the task_struct fields mems_allowed and mempolicy may be changed
- * by other task, we use alloc_lock in the task_struct fields to protect
- * them.
- *
- * The cpuset_common_seq_show() handlers only hold callback_lock across
- * small pieces of code, such as when reading out possibly multi-word
- * cpumasks and nodemasks.
- */
-
-static DEFINE_MUTEX(cpuset_mutex);
-
 /**
  * cpuset_lock - Acquire the global cpuset mutex
  *
@@ -1162,6 +1177,8 @@ static void reset_partition_data(struct cpuset *cs)
 static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
 {
 	WARN_ON_ONCE(old_prs == new_prs);
+	lockdep_assert_held(&callback_lock);
+	lockdep_assert_held(&cpuset_mutex);
 	if (new_prs == PRS_ISOLATED)
 		cpumask_or(isolated_cpus, isolated_cpus, xcpus);
 	else
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (2 preceding siblings ...)
  2026-02-21 18:54 ` [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-26 15:07   ` Frederic Weisbecker
  2026-02-21 18:54 ` [PATCH v6 5/8] kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command Waiman Long
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is
a change in the set of isolated CPUs, making this change is now more
costly than before.  Right now, the isolated_cpus_updating flag can be
set even if there is no real change in isolated_cpus. Put in additional
checks to make sure that isolated_cpus_updating is set only if there
is a real change in isolated_cpus.

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e8c0b3cfd1f9..05adf6697030 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1179,11 +1179,15 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
 	WARN_ON_ONCE(old_prs == new_prs);
 	lockdep_assert_held(&callback_lock);
 	lockdep_assert_held(&cpuset_mutex);
-	if (new_prs == PRS_ISOLATED)
+	if (new_prs == PRS_ISOLATED) {
+		if (cpumask_subset(xcpus, isolated_cpus))
+			return;
 		cpumask_or(isolated_cpus, isolated_cpus, xcpus);
-	else
+	} else {
+		if (!cpumask_intersects(xcpus, isolated_cpus))
+			return;
 		cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
-
+	}
 	isolated_cpus_updating = true;
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 5/8] kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (3 preceding siblings ...)
  2026-02-21 18:54 ` [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-21 18:54 ` [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together Waiman Long
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The "S+" command is used in the test matrix to enable the cpuset
controller. However this can be done automatically and we never use the
"S-" command to disable cpuset controller. Simplify the test matrix and
reduce clutter by removing the command and doing that automatically.
There is no functional change to the test cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../selftests/cgroup/test_cpuset_prs.sh       | 214 +++++++++---------
 1 file changed, 105 insertions(+), 109 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 5dff3ad53867..0c5db118f2d1 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -196,7 +196,6 @@ test_add_proc()
 #  P<v> = set cpus.partition (0:member, 1:root, 2:isolated)
 #  C<l> = add cpu-list to cpuset.cpus
 #  X<l> = add cpu-list to cpuset.cpus.exclusive
-#  S<p> = use prefix in subtree_control
 #  T    = put a task into cgroup
 #  CX<l> = add cpu-list to both cpuset.cpus and cpuset.cpus.exclusive
 #  O<c>=<v> = Write <v> to CPU online file of <c>
@@ -209,44 +208,44 @@ test_add_proc()
 # sched-debug matching which includes offline CPUs and single-CPU partitions
 # while the second one is for matching cpuset.cpus.isolated.
 #
-SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+SETUP_A123_PARTITIONS="C1-3:P1 C2-3:P1 C3:P1"
 TEST_MATRIX=(
 	#  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
 	#  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
-	"   C0-1     .      .    C2-3    S+    C4-5     .      .     0 A2:0-1"
+	"   C0-1     .      .    C2-3     .    C4-5     .      .     0 A2:0-1"
 	"   C0-1     .      .    C2-3    P1      .      .      .     0 "
-	"   C0-1     .      .    C2-3   P1:S+ C0-1:P1   .      .     0 "
-	"   C0-1     .      .    C2-3   P1:S+  C1:P1    .      .     0 "
-	"  C0-1:S+   .      .    C2-3     .      .      .     P1     0 "
-	"  C0-1:P1   .      .    C2-3    S+     C1      .      .     0 "
-	"  C0-1:P1   .      .    C2-3    S+    C1:P1    .      .     0 "
-	"  C0-1:P1   .      .    C2-3    S+    C1:P1    .     P1     0 "
+	"   C0-1     .      .    C2-3    P1   C0-1:P1   .      .     0 "
+	"   C0-1     .      .    C2-3    P1    C1:P1    .      .     0 "
+	"   C0-1     .      .    C2-3     .      .      .     P1     0 "
+	"  C0-1:P1   .      .    C2-3     .     C1      .      .     0 "
+	"  C0-1:P1   .      .    C2-3     .    C1:P1    .      .     0 "
+	"  C0-1:P1   .      .    C2-3     .    C1:P1    .     P1     0 "
+	"  C0-1:P1   .      .    C2-3   C4-5     .      .      .     0 A1:4-5"
 	"  C0-1:P1   .      .    C2-3   C4-5     .      .      .     0 A1:4-5"
-	"  C0-1:P1   .      .    C2-3  S+:C4-5   .      .      .     0 A1:4-5"
 	"   C0-1     .      .   C2-3:P1   .      .      .     C2     0 "
 	"   C0-1     .      .   C2-3:P1   .      .      .    C4-5    0 B1:4-5"
-	"C0-3:P1:S+ C2-3:P1 .      .      .      .      .      .     0 A1:0-1|A2:2-3|XA2:2-3"
-	"C0-3:P1:S+ C2-3:P1 .      .     C1-3    .      .      .     0 A1:1|A2:2-3|XA2:2-3"
-	"C2-3:P1:S+  C3:P1  .      .     C3      .      .      .     0 A1:|A2:3|XA2:3 A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P1  .      .     C3      P0     .      .     0 A1:3|A2:3 A1:P1|A2:P0"
-	"C2-3:P1:S+  C2:P1  .      .     C2-4    .      .      .     0 A1:3-4|A2:2"
-	"C2-3:P1:S+  C3:P1  .      .     C3      .      .     C0-2   0 A1:|B1:0-2 A1:P1|A2:P1"
+	"  C0-3:P1 C2-3:P1  .      .      .      .      .      .     0 A1:0-1|A2:2-3|XA2:2-3"
+	"  C0-3:P1 C2-3:P1  .      .     C1-3    .      .      .     0 A1:1|A2:2-3|XA2:2-3"
+	"  C2-3:P1  C3:P1   .      .     C3      .      .      .     0 A1:|A2:3|XA2:3 A1:P1|A2:P1"
+	"  C2-3:P1  C3:P1   .      .     C3      P0     .      .     0 A1:3|A2:3 A1:P1|A2:P0"
+	"  C2-3:P1  C2:P1   .      .     C2-4    .      .      .     0 A1:3-4|A2:2"
+	"  C2-3:P1  C3:P1   .      .     C3      .      .     C0-2   0 A1:|B1:0-2 A1:P1|A2:P1"
 	"$SETUP_A123_PARTITIONS    .     C2-3    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 
 	# CPU offlining cases:
-	"   C0-1     .      .    C2-3    S+    C4-5     .     O2=0   0 A1:0-1|B1:3"
-	"C0-3:P1:S+ C2-3:P1 .      .     O2=0    .      .      .     0 A1:0-1|A2:3"
-	"C0-3:P1:S+ C2-3:P1 .      .     O2=0   O2=1    .      .     0 A1:0-1|A2:2-3"
-	"C0-3:P1:S+ C2-3:P1 .      .     O1=0    .      .      .     0 A1:0|A2:2-3"
-	"C0-3:P1:S+ C2-3:P1 .      .     O1=0   O1=1    .      .     0 A1:0-1|A2:2-3"
-	"C2-3:P1:S+  C3:P1  .      .     O3=0   O3=1    .      .     0 A1:2|A2:3 A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P2  .      .     O3=0   O3=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
-	"C2-3:P1:S+  C3:P1  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
-	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
-	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
+	"   C0-1     .      .    C2-3     .    C4-5     .     O2=0   0 A1:0-1|B1:3"
+	"  C0-3:P1 C2-3:P1  .      .     O2=0    .      .      .     0 A1:0-1|A2:3"
+	"  C0-3:P1 C2-3:P1  .      .     O2=0   O2=1    .      .     0 A1:0-1|A2:2-3"
+	"  C0-3:P1 C2-3:P1  .      .     O1=0    .      .      .     0 A1:0|A2:2-3"
+	"  C0-3:P1 C2-3:P1  .      .     O1=0   O1=1    .      .     0 A1:0-1|A2:2-3"
+	"  C2-3:P1  C3:P1   .      .     O3=0   O3=1    .      .     0 A1:2|A2:3 A1:P1|A2:P1"
+	"  C2-3:P1  C3:P2   .      .     O3=0   O3=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
+	"  C2-3:P1  C3:P1   .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P1"
+	"  C2-3:P1  C3:P2   .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
+	"  C2-3:P1  C3:P1   .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
+	"  C2-3:P1  C3:P1   .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
+	"  C2-3:P1  C3:P1   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
+	"  C2-3:P1  C3:P1   .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
 	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -264,88 +263,87 @@ TEST_MATRIX=(
 	#
 	# Remote partition and cpuset.cpus.exclusive tests
 	#
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3     .      .      .     0 A1:0-3|A2:1-3|A3:2-3|XA1:2-3"
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3  X2-3:P2   .      .     0 A1:0-1|A2:2-3|A3:2-3 A1:P0|A2:P2 2-3"
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X3:P2    .      .     0 A1:0-2|A2:3|A3:3 A1:P0|A2:P2 3"
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3  X2-3:P2   .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:C3 .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
-	" C0-3:S+ C1-3:S+ C2-3   C2-3     .      .      .      P2    0 A1:0-1|A2:1|A3:1|B1:2-3 A1:P0|A3:P0|B1:P2"
-	" C0-3:S+ C1-3:S+ C2-3   C4-5     .      .      .      P2    0 B1:4-5 B1:P2 4-5"
-	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3  X2-3:P2   P2    0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
-	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3 X2-3:P2:C1-3 P2  0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
-	" C0-3:S+ C1-3:S+ C2-3    C4    X1-3  X1-3:P2   P2     .     0 A2:1|A3:2-3 A2:P2|A3:P2 1-3"
-	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3  X2-3:P2 P2:C4-5 0 A3:2-3|B1:4-5 A3:P2|B1:P2 2-5"
-	" C4:X0-3:S+ X1-3:S+ X2-3  .      .      P2     .      .     0 A1:4|A2:1-3|A3:1-3 A2:P2 1-3"
-	" C4:X0-3:S+ X1-3:S+ X2-3  .      .      .      P2     .     0 A1:4|A2:4|A3:2-3 A3:P2 2-3"
+	"   C0-3    C1-3  C2-3     .    X2-3     .      .      .     0 A1:0-3|A2:1-3|A3:2-3|XA1:2-3"
+	"   C0-3    C1-3  C2-3     .    X2-3  X2-3:P2   .      .     0 A1:0-1|A2:2-3|A3:2-3 A1:P0|A2:P2 2-3"
+	"   C0-3    C1-3  C2-3     .    X2-3   X3:P2    .      .     0 A1:0-2|A2:3|A3:3 A1:P0|A2:P2 3"
+	"   C0-3    C1-3  C2-3     .    X2-3   X2-3  X2-3:P2   .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
+	"   C0-3    C1-3  C2-3     .    X2-3   X2-3 X2-3:P2:C3 .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
+	"   C0-3    C1-3  C2-3   C2-3     .      .      .      P2    0 A1:0-1|A2:1|A3:1|B1:2-3 A1:P0|A3:P0|B1:P2"
+	"   C0-3    C1-3  C2-3   C4-5     .      .      .      P2    0 B1:4-5 B1:P2 4-5"
+	"   C0-3    C1-3  C2-3    C4    X2-3   X2-3  X2-3:P2   P2    0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
+	"   C0-3    C1-3  C2-3    C4    X2-3   X2-3 X2-3:P2:C1-3 P2  0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
+	"   C0-3    C1-3  C2-3    C4    X1-3  X1-3:P2   P2     .     0 A2:1|A3:2-3 A2:P2|A3:P2 1-3"
+	"   C0-3    C1-3  C2-3    C4    X2-3   X2-3  X2-3:P2 P2:C4-5 0 A3:2-3|B1:4-5 A3:P2|B1:P2 2-5"
+	"  C4:X0-3  X1-3  X2-3     .      .      P2     .      .     0 A1:4|A2:1-3|A3:1-3 A2:P2 1-3"
+	"  C4:X0-3  X1-3  X2-3     .      .      .      P2     .     0 A1:4|A2:4|A3:2-3 A3:P2 2-3"
 
 	# Nested remote/local partition tests
-	" C0-3:S+ C1-3:S+ C2-3   C4-5   X2-3  X2-3:P1   P2     P1    0 A1:0-1|A2:|A3:2-3|B1:4-5 \
+	"   C0-3    C1-3  C2-3   C4-5   X2-3  X2-3:P1   P2     P1    0 A1:0-1|A2:|A3:2-3|B1:4-5 \
 								       A1:P0|A2:P1|A3:P2|B1:P1 2-3"
-	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3  X2-3:P1   P2     P1    0 A1:0-1|A2:|A3:2-3|B1:4 \
+	"   C0-3    C1-3  C2-3    C4    X2-3  X2-3:P1   P2     P1    0 A1:0-1|A2:|A3:2-3|B1:4 \
 								       A1:P0|A2:P1|A3:P2|B1:P1 2-4|2-3"
-	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3  X2-3:P1    .     P1    0 A1:0-1|A2:2-3|A3:2-3|B1:4 \
+	"   C0-3    C1-3  C2-3    C4    X2-3  X2-3:P1    .     P1    0 A1:0-1|A2:2-3|A3:2-3|B1:4 \
 								       A1:P0|A2:P1|A3:P0|B1:P1"
-	" C0-3:S+ C1-3:S+  C3     C4    X2-3  X2-3:P1   P2     P1    0 A1:0-1|A2:2|A3:3|B1:4 \
+	"   C0-3    C1-3   C3     C4    X2-3  X2-3:P1   P2     P1    0 A1:0-1|A2:2|A3:3|B1:4 \
 								       A1:P0|A2:P1|A3:P2|B1:P1 2-4|3"
-	" C0-4:S+ C1-4:S+ C2-4     .    X2-4  X2-4:P2  X4:P1    .    0 A1:0-1|A2:2-3|A3:4 \
+	"   C0-4    C1-4  C2-4     .    X2-4  X2-4:P2  X4:P1    .    0 A1:0-1|A2:2-3|A3:4 \
 								       A1:P0|A2:P2|A3:P1 2-4|2-3"
-	" C0-4:S+ C1-4:S+ C2-4     .    X2-4  X2-4:P2 X3-4:P1   .    0 A1:0-1|A2:2|A3:3-4 \
+	"   C0-4    C1-4  C2-4     .    X2-4  X2-4:P2 X3-4:P1   .    0 A1:0-1|A2:2|A3:3-4 \
 								       A1:P0|A2:P2|A3:P1 2"
-	" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
+	" C0-4:X2-4 C1-4:X2-4:P2 C2-4:X4:P1 \
 				   .      .      X5      .      .    0 A1:0-4|A2:1-4|A3:2-4 \
 								       A1:P0|A2:P-2|A3:P-1 ."
-	" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
+	" C0-4:X2-4 C1-4:X2-4:P2 C2-4:X4:P1 \
 				   .      .      .      X1      .    0 A1:0-1|A2:2-4|A3:2-4 \
 								       A1:P0|A2:P2|A3:P-1 2-4"
 
 	# Remote partition offline tests
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:O2=0 .   0 A1:0-1|A2:1|A3:3 A1:P0|A3:P2 2-3"
-	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
-	" C0-3:S+ C1-3:S+  C3      .    X2-3   X2-3    P2:O3=0   .   0 A1:0-2|A2:1-2|A3: A1:P0|A3:P2 3"
-	" C0-3:S+ C1-3:S+  C3      .    X2-3   X2-3   T:P2:O3=0  .   0 A1:0-2|A2:1-2|A3:1-2 A1:P0|A3:P-2 3|"
+	"   C0-3    C1-3  C2-3     .    X2-3   X2-3 X2-3:P2:O2=0 .   0 A1:0-1|A2:1|A3:3 A1:P0|A3:P2 2-3"
+	"   C0-3    C1-3  C2-3     .    X2-3   X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
+	"   C0-3    C1-3   C3      .    X2-3   X2-3    P2:O3=0   .   0 A1:0-2|A2:1-2|A3: A1:P0|A3:P2 3"
+	"   C0-3    C1-3   C3      .    X2-3   X2-3   T:P2:O3=0  .   0 A1:0-2|A2:1-2|A3:1-2 A1:P0|A3:P-2 3|"
 
 	# An invalidated remote partition cannot self-recover from hotplug
-	" C0-3:S+ C1-3:S+  C2      .    X2-3   X2-3   T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2 A1:P0|A3:P-2 ."
+	"   C0-3    C1-3   C2      .    X2-3   X2-3   T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2 A1:P0|A3:P-2 ."
 
 	# cpus.exclusive.effective clearing test
-	" C0-3:S+ C1-3:S+  C2      .   X2-3:X    .      .      .     0 A1:0-3|A2:1-3|A3:2|XA1:"
+	"   C0-3    C1-3   C2      .   X2-3:X    .      .      .     0 A1:0-3|A2:1-3|A3:2|XA1:"
 
 	# Invalid to valid remote partition transition test
-	" C0-3:S+   C1-3    .      .      .    X3:P2    .      .     0 A1:0-3|A2:1-3|XA2: A2:P-2 ."
-	" C0-3:S+ C1-3:X3:P2
-			    .      .    X2-3    P2      .      .     0 A1:0-2|A2:3|XA2:3 A2:P2 3"
+	"   C0-3    C1-3    .      .      .    X3:P2    .      .     0 A1:0-3|A2:1-3|XA2: A2:P-2 ."
+	"   C0-3 C1-3:X3:P2 .      .    X2-3    P2      .      .     0 A1:0-2|A2:3|XA2:3 A2:P2 3"
 
 	# Invalid to valid local partition direct transition tests
-	" C1-3:S+:P2 X4:P2  .      .      .      .      .      .     0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
-	" C1-3:S+:P2 X4:P2  .      .      .    X3:P2    .      .     0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
-	"  C0-3:P2   .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:5-6 A1:P2|B1:P0"
-	"  C0-3:P2   .      .    C4-6 C0-4:C0-3  .      .      .     0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
+	" C1-3:P2  X4:P2    .      .      .      .      .      .     0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
+	" C1-3:P2  X4:P2    .      .      .    X3:P2    .      .     0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
+	" C0-3:P2    .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:5-6 A1:P2|B1:P0"
+	" C0-3:P2    .      .    C4-6 C0-4:C0-3  .      .      .     0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
 
 	# Local partition invalidation tests
-	" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
+	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
 				   .      .      .      .      .     0 A1:1|A2:2|A3:3 A1:P2|A2:P2|A3:P2 1-3"
-	" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
+	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
 				   .      .     X4      .      .     0 A1:1-3|A2:1-3|A3:2-3|XA2:|XA3: A1:P2|A2:P-2|A3:P-2 1-3"
-	" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
+	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
 				   .      .    C4:X     .      .     0 A1:1-3|A2:1-3|A3:2-3|XA2:|XA3: A1:P2|A2:P-2|A3:P-2 1-3"
 	# Local partition CPU change tests
-	" C0-5:S+:P2 C4-5:S+:P1 .  .      .    C3-5     .      .     0 A1:0-2|A2:3-5 A1:P2|A2:P1 0-2"
-	" C0-5:S+:P2 C4-5:S+:P1 .  .    C1-5     .      .      .     0 A1:1-3|A2:4-5 A1:P2|A2:P1 1-3"
+	" C0-5:P2  C4-5:P1  .      .      .    C3-5     .      .     0 A1:0-2|A2:3-5 A1:P2|A2:P1 0-2"
+	" C0-5:P2  C4-5:P1  .      .    C1-5     .      .      .     0 A1:1-3|A2:4-5 A1:P2|A2:P1 1-3"
 
 	# cpus_allowed/exclusive_cpus update tests
-	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
+	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3 \
 				   .    X:C4     .      P2     .     0 A1:4|A2:4|XA2:|XA3:|A3:4 \
 								       A1:P0|A3:P-2 ."
-	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
+	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3 \
 				   .     X1      .      P2     .     0 A1:0-3|A2:1-3|XA1:1|XA2:|XA3:|A3:2-3 \
 								       A1:P0|A3:P-2 ."
-	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
+	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3 \
 				   .      .     X3      P2     .     0 A1:0-2|A2:1-2|XA2:3|XA3:3|A3:3 \
 								       A1:P0|A3:P2 3"
-	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
+	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3:P2 \
 				   .      .     X3      .      .     0 A1:0-2|A2:1-2|XA2:3|XA3:3|A3:3|XA3:3 \
 								       A1:P0|A3:P2 3"
-	" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
+	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3:P2 \
 				   .     X4      .      .      .     0 A1:0-3|A2:1-3|A3:2-3|XA1:4|XA2:|XA3 \
 								       A1:P0|A3:P-2"
 
@@ -356,37 +354,37 @@ TEST_MATRIX=(
 	#
 	# Adding CPUs to partition root that are not in parent's
 	# cpuset.cpus is allowed, but those extra CPUs are ignored.
-	"C2-3:P1:S+ C3:P1   .      .      .     C2-4    .      .     0 A1:|A2:2-3 A1:P1|A2:P1"
+	"  C2-3:P1   C3:P1  .      .      .     C2-4    .      .     0 A1:|A2:2-3 A1:P1|A2:P1"
 
 	# Taking away all CPUs from parent or itself if there are tasks
 	# will make the partition invalid.
-	"C2-3:P1:S+  C3:P1  .      .      T     C2-3    .      .     0 A1:2-3|A2:2-3 A1:P1|A2:P-1"
-	" C3:P1:S+    C3    .      .      T      P1     .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
+	"  C2-3:P1   C3:P1  .      .      T     C2-3    .      .     0 A1:2-3|A2:2-3 A1:P1|A2:P-1"
+	"   C3:P1     C3    .      .      T      P1     .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
 	"$SETUP_A123_PARTITIONS    .    T:C2-3   .      .      .     0 A1:2-3|A2:2-3|A3:3 A1:P1|A2:P-1|A3:P-1"
 	"$SETUP_A123_PARTITIONS    . T:C2-3:C1-3 .      .      .     0 A1:1|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 
 	# Changing a partition root to member makes child partitions invalid
-	"C2-3:P1:S+  C3:P1  .      .      P0     .      .      .     0 A1:2-3|A2:3 A1:P0|A2:P-1"
+	"  C2-3:P1   C3:P1  .      .      P0     .      .      .     0 A1:2-3|A2:3 A1:P0|A2:P-1"
 	"$SETUP_A123_PARTITIONS    .     C2-3    P0     .      .     0 A1:2-3|A2:2-3|A3:3 A1:P1|A2:P0|A3:P-1"
 
 	# cpuset.cpus can contains cpus not in parent's cpuset.cpus as long
 	# as they overlap.
-	"C2-3:P1:S+  .      .      .      .   C3-4:P1   .      .     0 A1:2|A2:3 A1:P1|A2:P1"
+	"  C2-3:P1   .      .      .      .   C3-4:P1   .      .     0 A1:2|A2:3 A1:P1|A2:P1"
 
 	# Deletion of CPUs distributed to child cgroup is allowed.
-	"C0-1:P1:S+ C1      .    C2-3   C4-5     .      .      .     0 A1:4-5|A2:4-5"
+	"  C0-1:P1  C1      .    C2-3   C4-5     .      .      .     0 A1:4-5|A2:4-5"
 
 	# To become a valid partition root, cpuset.cpus must overlap parent's
 	# cpuset.cpus.
-	"  C0-1:P1   .      .    C2-3    S+   C4-5:P1   .      .     0 A1:0-1|A2:0-1 A1:P1|A2:P-1"
+	"  C0-1:P1   .      .    C2-3     .   C4-5:P1   .      .     0 A1:0-1|A2:0-1 A1:P1|A2:P-1"
 
 	# Enabling partition with child cpusets is allowed
-	"  C0-1:S+  C1      .    C2-3    P1      .      .      .     0 A1:0-1|A2:1 A1:P1"
+	"   C0-1    C1      .    C2-3    P1      .      .      .     0 A1:0-1|A2:1 A1:P1"
 
 	# A partition root with non-partition root parent is invalid| but it
 	# can be made valid if its parent becomes a partition root too.
-	"  C0-1:S+  C1      .    C2-3     .      P2     .      .     0 A1:0-1|A2:1 A1:P0|A2:P-2"
-	"  C0-1:S+ C1:P2    .    C2-3     P1     .      .      .     0 A1:0|A2:1 A1:P1|A2:P2 0-1|1"
+	"   C0-1    C1      .    C2-3     .      P2     .      .     0 A1:0-1|A2:1 A1:P0|A2:P-2"
+	"   C0-1   C1:P2    .    C2-3     P1     .      .      .     0 A1:0|A2:1 A1:P1|A2:P2 0-1|1"
 
 	# A non-exclusive cpuset.cpus change will not invalidate its siblings partition.
 	"  C0-1:P1   .      .    C2-3   C0-2     .      .      .     0 A1:0-2|B1:3 A1:P1|B1:P0"
@@ -398,23 +396,23 @@ TEST_MATRIX=(
 
 	# Child partition root that try to take all CPUs from parent partition
 	# with tasks will remain invalid.
-	" C1-4:P1:S+ P1     .      .       .     .      .      .     0 A1:1-4|A2:1-4 A1:P1|A2:P-1"
-	" C1-4:P1:S+ P1     .      .       .   C1-4     .      .     0 A1|A2:1-4 A1:P1|A2:P1"
-	" C1-4:P1:S+ P1     .      .       T   C1-4     .      .     0 A1:1-4|A2:1-4 A1:P1|A2:P-1"
+	"  C1-4:P1  P1      .      .       .     .      .      .     0 A1:1-4|A2:1-4 A1:P1|A2:P-1"
+	"  C1-4:P1  P1      .      .       .   C1-4     .      .     0 A1|A2:1-4 A1:P1|A2:P1"
+	"  C1-4:P1  P1      .      .       T   C1-4     .      .     0 A1:1-4|A2:1-4 A1:P1|A2:P-1"
 
 	# Clearing of cpuset.cpus with a preset cpuset.cpus.exclusive shouldn't
 	# affect cpuset.cpus.exclusive.effective.
-	" C1-4:X3:S+ C1:X3  .      .       .     C      .      .     0 A2:1-4|XA2:3"
+	"  C1-4:X3 C1:X3    .      .       .     C      .      .     0 A2:1-4|XA2:3"
 
 	# cpuset.cpus can contain CPUs that overlap a sibling cpuset with cpus.exclusive
 	# but creating a local partition out of it is not allowed. Similarly and change
 	# in cpuset.cpus of a local partition that overlaps sibling exclusive CPUs will
 	# invalidate it.
-	" CX1-4:S+ CX2-4:P2 .    C5-6      .     .      .      P1    0 A1:1|A2:2-4|B1:5-6|XB1:5-6 \
+	"  CX1-4  CX2-4:P2  .    C5-6      .     .      .      P1    0 A1:1|A2:2-4|B1:5-6|XB1:5-6 \
 								       A1:P0|A2:P2:B1:P1 2-4"
-	" CX1-4:S+ CX2-4:P2 .    C3-6      .     .      .      P1    0 A1:1|A2:2-4|B1:5-6 \
+	"  CX1-4  CX2-4:P2  .    C3-6      .     .      .      P1    0 A1:1|A2:2-4|B1:5-6 \
 								       A1:P0|A2:P2:B1:P-1 2-4"
-	" CX1-4:S+ CX2-4:P2 .    C5-6      .     .      .   P1:C3-6  0 A1:1|A2:2-4|B1:5-6 \
+	"  CX1-4  CX2-4:P2  .    C5-6      .     .      .   P1:C3-6  0 A1:1|A2:2-4|B1:5-6 \
 								       A1:P0|A2:P2:B1:P-1 2-4"
 
 	# When multiple partitions with conflicting cpuset.cpus are created, the
@@ -426,14 +424,14 @@ TEST_MATRIX=(
 	" C1-3:X1-3  .      .    C4-5      .     .      .     C1-2   0 A1:1-3|B1:1-2"
 
 	# cpuset.cpus can become empty with task in it as it inherits parent's effective CPUs
-	" C1-3:S+   C2      .      .       .    T:C     .      .     0 A1:1-3|A2:1-3"
+	"   C1-3    C2      .      .       .    T:C     .      .     0 A1:1-3|A2:1-3"
 
 	#  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
 	#  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
 	# Failure cases:
 
 	# A task cannot be added to a partition with no cpu
-	"C2-3:P1:S+  C3:P1  .      .    O2=0:T   .      .      .     1 A1:|A2:3 A1:P1|A2:P1"
+	"  C2-3:P1 C3:P1    .      .    O2=0:T   .      .      .     1 A1:|A2:3 A1:P1|A2:P1"
 
 	# Changes to cpuset.cpus.exclusive that violate exclusivity rule is rejected
 	"   C0-3     .      .    C4-5   X0-3     .      .     X3-5   1 A1:0-3|B1:4-5"
@@ -465,31 +463,31 @@ REMOTE_TEST_MATRIX=(
 	#  old-p1 old-p2 old-c11 old-c12 old-c21 old-c22
 	#  new-p1 new-p2 new-c11 new-c12 new-c21 new-c22 ECPUs Pstate ISOLCPUS
 	#  ------ ------ ------- ------- ------- ------- ----- ------ --------
-	" X1-3:S+ X4-6:S+ X1-2     X3     X4-5     X6 \
+	"   X1-3   X4-6  X1-2      X3     X4-5     X6 \
 	      .      .     P2      P2      P2      P2    c11:1-2|c12:3|c21:4-5|c22:6 \
 							 c11:P2|c12:P2|c21:P2|c22:P2 1-6"
-	" CX1-4:S+   .   X1-2:P2   C3      .       .  \
+	"  CX1-4     .  X1-2:P2    C3      .       .  \
 	      .      .     .      C3-4     .       .     p1:3-4|c11:1-2|c12:3-4 \
 							 p1:P0|c11:P2|c12:P0 1-2"
-	" CX1-4:S+   .   X1-2:P2   .       .       .  \
+	"  CX1-4     .  X1-2:P2    .       .       .  \
 	    X2-4     .     .       .       .       .     p1:1,3-4|c11:2 \
 							 p1:P0|c11:P2 2"
-	" CX1-5:S+   .   X1-2:P2 X3-5:P1   .       .  \
+	"  CX1-5     .  X1-2:P2  X3-5:P1   .       .  \
 	    X2-4     .     .       .       .       .     p1:1,5|c11:2|c12:3-4 \
 							 p1:P0|c11:P2|c12:P1 2"
-	" CX1-4:S+   .   X1-2:P2 X3-4:P1   .       .  \
+	"  CX1-4     .  X1-2:P2  X3-4:P1   .       .  \
 	      .      .     X2      .       .       .     p1:1|c11:2|c12:3-4 \
 							 p1:P0|c11:P2|c12:P1 2"
 	# p1 as member, will get its effective CPUs from its parent rtest
-	" CX1-4:S+   .   X1-2:P2 X3-4:P1   .       .  \
+	"  CX1-4     .  X1-2:P2  X3-4:P1   .       .  \
 	      .      .     X1     CX2-4    .       .     p1:5-7|c11:1|c12:2-4 \
 							 p1:P0|c11:P2|c12:P1 1"
-	" CX1-4:S+ X5-6:P1:S+ .    .       .       .  \
-	      .      .   X1-2:P2  X4-5:P1  .     X1-7:P2 p1:3|c11:1-2|c12:4:c22:5-6 \
+	"  CX1-4  X5-6:P1  .       .       .       .  \
+	      .      .  X1-2:P2  X4-5:P1   .     X1-7:P2 p1:3|c11:1-2|c12:4:c22:5-6 \
 							 p1:P0|p2:P1|c11:P2|c12:P1|c22:P2 \
 							 1-2,4-6|1-2,5-6"
 	# c12 whose cpuset.cpus CPUs are all granted to c11 will become invalid partition
-	" C1-5:P1:S+ .  C1-4:P1   C2-3     .       .  \
+	"  C1-5:P1   .  C1-4:P1   C2-3     .       .  \
 	      .      .     .       P1      .       .     p1:5|c11:1-4|c12:5 \
 							 p1:P1|c11:P1|c12:P-1"
 )
@@ -530,7 +528,6 @@ set_ctrl_state()
 	CGRP=$1
 	STATE=$2
 	SHOWERR=${3}
-	CTRL=${CTRL:=$CONTROLLER}
 	HASERR=0
 	REDIRECT="2> $TMPMSG"
 	[[ -z "$STATE" || "$STATE" = '.' ]] && return 0
@@ -540,15 +537,16 @@ set_ctrl_state()
 	for CMD in $(echo $STATE | sed -e "s/:/ /g")
 	do
 		TFILE=$CGRP/cgroup.procs
-		SFILE=$CGRP/cgroup.subtree_control
 		PFILE=$CGRP/cpuset.cpus.partition
 		CFILE=$CGRP/cpuset.cpus
 		XFILE=$CGRP/cpuset.cpus.exclusive
-		case $CMD in
-		    S*) PREFIX=${CMD#?}
-			COMM="echo ${PREFIX}${CTRL} > $SFILE"
+
+		# Enable cpuset controller if not enabled yet
+		[[ -f $CFILE ]] || {
+			COMM="echo +cpuset > $CGRP/../cgroup.subtree_control"
 			eval $COMM $REDIRECT
-			;;
+		}
+		case $CMD in
 		    X*)
 			CPUS=${CMD#?}
 			COMM="echo $CPUS > $XFILE"
@@ -947,7 +945,6 @@ check_test_results()
 run_state_test()
 {
 	TEST=$1
-	CONTROLLER=cpuset
 	CGROUP_LIST=". A1 A1/A2 A1/A2/A3 B1"
 	RESET_LIST="A1/A2/A3 A1/A2 A1 B1"
 	I=0
@@ -1003,7 +1000,6 @@ run_state_test()
 run_remote_state_test()
 {
 	TEST=$1
-	CONTROLLER=cpuset
 	[[ -d rtest ]] || mkdir rtest
 	cd rtest
 	echo +cpuset > cgroup.subtree_control
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (4 preceding siblings ...)
  2026-02-21 18:54 ` [PATCH v6 5/8] kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-26 15:51   ` Frederic Weisbecker
  2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

With the latest changes in sched/isolation.c, rebuild_sched_domains*()
requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly
updated first, if needed, before the sched domains can be
rebuilt. So the two naturally fit together. Do that by creating a new
update_hk_sched_domains() helper to house both actions.

The name of the isolated_cpus_updating flag to control the
call to housekeeping_update() is now outdated. So change it to
update_housekeeping to better reflect its purpose. Also move the call
to update_hk_sched_domains() to the end of cpuset and hotplug operations
before releasing the cpuset_mutex.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 51 ++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 05adf6697030..3d0d18bf182f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -130,10 +130,9 @@ static cpumask_var_t	subpartitions_cpus;	/* RWCS */
 static cpumask_var_t	isolated_cpus;		/* CSCB */
 
 /*
- * Set if isolated_cpus is being updated in the current cpuset_mutex
- * critical section.
+ * Set if housekeeping cpumasks are to be updated.
  */
-static bool		isolated_cpus_updating;	/* RWCS */
+static bool		update_housekeeping;	/* RWCS */
 
 /*
  * A flag to force sched domain rebuild at the end of an operation.
@@ -1188,7 +1187,7 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
 			return;
 		cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
 	}
-	isolated_cpus_updating = true;
+	update_housekeeping = true;
 }
 
 /*
@@ -1306,22 +1305,22 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 }
 
 /*
- * update_isolation_cpumasks - Update external isolation related CPU masks
+ * update_hk_sched_domains - Update HK cpumasks & rebuild sched domains
  *
- * The following external CPU masks will be updated if necessary:
- * - workqueue unbound cpumask
+ * Update housekeeping cpumasks and rebuild sched domains if necessary.
+ * This should be called at the end of cpuset or hotplug actions.
  */
-static void update_isolation_cpumasks(void)
+static void update_hk_sched_domains(void)
 {
-	int ret;
-
-	if (!isolated_cpus_updating)
-		return;
-
-	ret = housekeeping_update(isolated_cpus);
-	WARN_ON_ONCE(ret < 0);
-
-	isolated_cpus_updating = false;
+	if (update_housekeeping) {
+		/* Updating HK cpumasks implies rebuild sched domains */
+		WARN_ON_ONCE(housekeeping_update(isolated_cpus));
+		update_housekeeping = false;
+		force_sd_rebuild = true;
+	}
+	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
+	if (force_sd_rebuild)
+		rebuild_sched_domains_locked();
 }
 
 /**
@@ -1472,7 +1471,6 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	cs->remote_partition = true;
 	cpumask_copy(cs->effective_xcpus, tmp->new_cpus);
 	spin_unlock_irq(&callback_lock);
-	update_isolation_cpumasks();
 	cpuset_force_rebuild();
 	cs->prs_err = 0;
 
@@ -1517,7 +1515,6 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
 	compute_excpus(cs, cs->effective_xcpus);
 	reset_partition_data(cs);
 	spin_unlock_irq(&callback_lock);
-	update_isolation_cpumasks();
 	cpuset_force_rebuild();
 
 	/*
@@ -1588,7 +1585,6 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
 	if (xcpus)
 		cpumask_copy(cs->exclusive_cpus, xcpus);
 	spin_unlock_irq(&callback_lock);
-	update_isolation_cpumasks();
 	if (adding || deleting)
 		cpuset_force_rebuild();
 
@@ -1932,7 +1928,6 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 		partition_xcpus_add(new_prs, parent, tmp->delmask);
 
 	spin_unlock_irq(&callback_lock);
-	update_isolation_cpumasks();
 
 	if ((old_prs != new_prs) && (cmd == partcmd_update))
 		update_partition_exclusive_flag(cs, new_prs);
@@ -2900,7 +2895,6 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	else if (isolcpus_updated)
 		isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus);
 	spin_unlock_irq(&callback_lock);
-	update_isolation_cpumasks();
 
 	/* Force update if switching back to member & update effective_xcpus */
 	update_cpumasks_hier(cs, &tmpmask, !new_prs);
@@ -3190,9 +3184,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	}
 
 	free_cpuset(trialcs);
-	if (force_sd_rebuild)
-		rebuild_sched_domains_locked();
 out_unlock:
+	update_hk_sched_domains();
 	cpuset_full_unlock();
 	if (of_cft(of)->private == FILE_MEMLIST)
 		schedule_flush_migrate_mm();
@@ -3300,6 +3293,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
 	cpuset_full_lock();
 	if (is_cpuset_online(cs))
 		retval = update_prstate(cs, val);
+	update_hk_sched_domains();
 	cpuset_full_unlock();
 	return retval ?: nbytes;
 }
@@ -3474,6 +3468,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
 	/* Reset valid partition back to member */
 	if (is_partition_valid(cs))
 		update_prstate(cs, PRS_MEMBER);
+	update_hk_sched_domains();
 	cpuset_full_unlock();
 }
 
@@ -3881,10 +3876,12 @@ static void cpuset_handle_hotplug(void)
 		rcu_read_unlock();
 	}
 
-	/* rebuild sched domains if necessary */
-	if (force_sd_rebuild)
-		rebuild_sched_domains_cpuslocked();
 
+	if (update_housekeeping || force_sd_rebuild) {
+		mutex_lock(&cpuset_mutex);
+		update_hk_sched_domains();
+		mutex_unlock(&cpuset_mutex);
+	}
 	free_tmpmasks(ptmp);
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (5 preceding siblings ...)
  2026-02-21 18:54 ` [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-02-26 16:06   ` Frederic Weisbecker
                     ` (2 more replies)
  2026-02-21 18:54 ` [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
  2026-02-23 20:57 ` [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Tejun Heo
  8 siblings, 3 replies; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.

As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().

An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
 .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3d0d18bf182f..2c80bfc30bbc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
 		rebuild_sched_domains_locked();
 }
 
+/*
+ * Work function to invoke update_hk_sched_domains()
+ */
+static void hk_sd_workfn(struct work_struct *work)
+{
+	cpuset_full_lock();
+	update_hk_sched_domains();
+	cpuset_full_unlock();
+}
+
 /**
  * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
  * @parent: Parent cpuset containing all siblings
@@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
  */
 static void cpuset_handle_hotplug(void)
 {
+	static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
 	static cpumask_t new_cpus;
 	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
@@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
 	}
 
 
-	if (update_housekeeping || force_sd_rebuild) {
-		mutex_lock(&cpuset_mutex);
-		update_hk_sched_domains();
-		mutex_unlock(&cpuset_mutex);
-	}
+	/*
+	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
+	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
+	 * cpumask can correctly reflect what is in isolated_cpus.
+	 *
+	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
+	 * is still pending. Before the pending bit is cleared, the work data
+	 * is copied out and work item dequeued. So it is possible to queue
+	 * the work again before the hk_sd_workfn() is invoked to process the
+	 * previously queued work. Since hk_sd_workfn() doesn't use the work
+	 * item at all, this is not a problem.
+	 */
+	if (update_housekeeping || force_sd_rebuild)
+		queue_work(system_unbound_wq, &hk_sd_work);
+
 	free_tmpmasks(ptmp);
 }
 
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 0c5db118f2d1..dc2dff361ec6 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -246,6 +246,9 @@ TEST_MATRIX=(
 	"  C2-3:P1  C3:P1   .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
 	"  C2-3:P1  C3:P1   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
 	"  C2-3:P1  C3:P1   .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
+	"  C2-3:P1  C3:P2   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
+	"  C1-3:P1  C3:P2   .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
+	"  C1-3:P1  C3:P2   .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
 	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -762,7 +765,7 @@ check_cgroup_states()
 # only CPUs in isolated partitions as well as those that are isolated at
 # boot time.
 #
-# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
 # <isolcpus1> - expected sched/domains value
 # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
 #
@@ -771,6 +774,7 @@ check_isolcpus()
 	EXPECTED_ISOLCPUS=$1
 	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
 	ISOLCPUS=$(cat $ISCPUS)
+	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
 	LASTISOLCPU=
 	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
 	if [[ $EXPECTED_ISOLCPUS = . ]]
@@ -808,6 +812,11 @@ check_isolcpus()
 	ISOLCPUS=
 	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
 
+	#
+	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
+	#
+	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
+
 	#
 	# Use the sched domain in debugfs to check isolated CPUs, if available
 	#
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (6 preceding siblings ...)
  2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
@ 2026-02-21 18:54 ` Waiman Long
  2026-03-02 12:14   ` Frederic Weisbecker
  2026-02-23 20:57 ` [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Tejun Heo
  8 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-02-21 18:54 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.18.0-test+ #2 Tainted: G S
  ------------------------------------------------------
  test_cpuset_prs/10971 is trying to acquire lock:
  ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180

  but task is already holding lock:
  ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:
  -> #4 (cpuset_mutex){+.+.}-{4:4}:
  -> #3 (cpu_hotplug_lock){++++}-{0:0}:
  -> #2 (rtnl_mutex){+.+.}-{4:4}:
  -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
  -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

  Chain exists of:
    (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

  5 locks held by test_cpuset_prs/10971:
   #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
   #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
   #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
   #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
   #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  Call Trace:
   <TASK>
     :
   touch_wq_lockdep_map+0x93/0x180
   __flush_workqueue+0x111/0x10b0
   housekeeping_update+0x12d/0x2d0
   update_parent_effective_cpumask+0x595/0x2440
   update_prstate+0x89d/0xce0
   cpuset_partition_write+0xc5/0x130
   cgroup_file_write+0x1a5/0x680
   kernfs_fop_write_iter+0x3df/0x5f0
   vfs_write+0x525/0xfd0
   ksys_write+0xf9/0x1d0
   do_syscall_64+0x95/0x520
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later beforce calling rebuild_sched_domains_locked().

To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.

Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c        | 47 +++++++++++++++++++++++++++++++----
 kernel/sched/isolation.c      |  4 +--
 kernel/time/timer_migration.c |  4 +--
 3 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2c80bfc30bbc..dbda09391b19 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
  * CPUSET Locking Convention
  * -------------------------
  *
- * Below are the three global locks guarding cpuset structures in lock
+ * Below are the four global/local locks guarding cpuset structures in lock
  * acquisition order:
+ *  - cpuset_top_mutex
  *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
  *  - cpuset_mutex
  *  - callback_lock (raw spinlock)
  *
- * A task must hold all the three locks to modify externally visible or
- * used fields of cpusets, though some of the internally used cpuset fields
+ * As cpuset will now indirectly flush a number of different workqueues in
+ * housekeeping_update() to update housekeeping cpumasks when the set of
+ * isolated CPUs is going to be changed, it may be vulnerable to deadlock
+ * if we hold cpus_read_lock while calling into housekeeping_update().
+ *
+ * The first cpuset_top_mutex will be held except when calling into
+ * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
+ * and cpuset_mutex will be held instead. The main purpose of this mutex
+ * is to prevent regular cpuset control file write actions from interfering
+ * with the call to housekeeping_update(), though CPU hotplug operation can
+ * still happen in parallel. This mutex also provides protection for some
+ * internal variables.
+ *
+ * A task must hold all the remaining three locks to modify externally visible
+ * or used fields of cpusets, though some of the internally used cpuset fields
  * and internal variables can be modified without holding callback_lock. If only
  * reliable read access of the externally used fields are needed, a task can
  * hold either cpuset_mutex or callback_lock which are exposed to other
@@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
  * cpumasks and nodemasks.
  */
 
+static DEFINE_MUTEX(cpuset_top_mutex);
 static DEFINE_MUTEX(cpuset_mutex);
 
 /*
@@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
  *
  * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
  *	 by holding both cpuset_mutex and callback_lock.
+ *
+ * T:	 Read/write-able by holding the cpuset_top_mutex.
  */
 
 /*
@@ -134,6 +151,11 @@ static cpumask_var_t	isolated_cpus;		/* CSCB */
  */
 static bool		update_housekeeping;	/* RWCS */
 
+/*
+ * Copy of isolated_cpus to be passed to housekeeping_update()
+ */
+static cpumask_var_t	isolated_hk_cpus;	/* T */
+
 /*
  * A flag to force sched domain rebuild at the end of an operation.
  * It can be set in
@@ -297,6 +319,7 @@ void lockdep_assert_cpuset_lock_held(void)
  */
 void cpuset_full_lock(void)
 {
+	mutex_lock(&cpuset_top_mutex);
 	cpus_read_lock();
 	mutex_lock(&cpuset_mutex);
 }
@@ -305,12 +328,14 @@ void cpuset_full_unlock(void)
 {
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
+	mutex_unlock(&cpuset_top_mutex);
 }
 
 #ifdef CONFIG_LOCKDEP
 bool lockdep_is_cpuset_held(void)
 {
-	return lockdep_is_held(&cpuset_mutex);
+	return lockdep_is_held(&cpuset_mutex) ||
+	       lockdep_is_held(&cpuset_top_mutex);
 }
 #endif
 
@@ -1314,9 +1339,20 @@ static void update_hk_sched_domains(void)
 {
 	if (update_housekeeping) {
 		/* Updating HK cpumasks implies rebuild sched domains */
-		WARN_ON_ONCE(housekeeping_update(isolated_cpus));
 		update_housekeeping = false;
 		force_sd_rebuild = true;
+		cpumask_copy(isolated_hk_cpus, isolated_cpus);
+
+		/*
+		 * housekeeping_update() is now called without holding
+		 * cpus_read_lock and cpuset_mutex. Only top_cpuset_mutex
+		 * is still being held for mutual exclusion.
+		 */
+		mutex_unlock(&cpuset_mutex);
+		cpus_read_unlock();
+		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
+		cpus_read_lock();
+		mutex_lock(&cpuset_mutex);
 	}
 	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
 	if (force_sd_rebuild)
@@ -3634,6 +3670,7 @@ int __init cpuset_init(void)
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b725d39c06e..ef152d401fe2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
 	struct cpumask *trial, *old = NULL;
 	int err;
 
-	lockdep_assert_cpus_held();
-
 	trial = kmalloc(cpumask_size(), GFP_KERNEL);
 	if (!trial)
 		return -ENOMEM;
@@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
 	}
 
 	if (!housekeeping.flags)
-		static_branch_enable_cpuslocked(&housekeeping_overridden);
+		static_branch_enable(&housekeeping_overridden);
 
 	if (housekeeping.flags & HK_FLAG_DOMAIN)
 		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 6da9cd562b20..83428aa03aef 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
 	int cpu;
 
-	lockdep_assert_cpus_held();
-
 	if (!works)
 		return -ENOMEM;
 	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
@@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	 * First set previously isolated CPUs as available (unisolate).
 	 * This cpumask contains only CPUs that switched to available now.
 	 */
+	guard(cpus_read_lock)();
 	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
 	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
 
@@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void)
 	cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 	/* Protect against RCU torture hotplug testing */
-	guard(cpus_read_lock)();
 	return tmigr_isolated_exclude_cpumask(cpumask);
 }
 late_initcall(tmigr_init_isolation);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues
  2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (7 preceding siblings ...)
  2026-02-21 18:54 ` [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
@ 2026-02-23 20:57 ` Tejun Heo
  2026-02-23 21:11   ` Waiman Long
  2026-03-02 12:21   ` Frederic Weisbecker
  8 siblings, 2 replies; 29+ messages in thread
From: Tejun Heo @ 2026-02-23 20:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Johannes Weiner, Michal Koutny, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

Hello,

> Waiman Long (8):
>   cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
>   cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
>   cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
>   cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
>   kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
>   cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
>   cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
>   cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock

Applied 1-8 to cgroup/for-7.0-fixes with the following minor fixups:

- 5/8: Removed a duplicate test entry that resulted from the "S+"
  removal (two previously-different lines becoming identical).

- 8/8: Fixed typos in commit message ("essentally" -> "essentially",
  "beforce" -> "before") and code comment ("top_cpuset_mutex" ->
  "cpuset_top_mutex").

This has gone through more than enough iterations. We can resolve further
issues if there's any incrementally.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues
  2026-02-23 20:57 ` [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Tejun Heo
@ 2026-02-23 21:11   ` Waiman Long
  2026-02-24  7:51     ` Chen Ridong
  2026-03-02 12:21   ` Frederic Weisbecker
  1 sibling, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-02-23 21:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Chen Ridong, Johannes Weiner, Michal Koutny, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest


On 2/23/26 3:57 PM, Tejun Heo wrote:
> Hello,
>
>> Waiman Long (8):
>>    cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
>>    cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
>>    cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
>>    cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
>>    kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
>>    cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
>>    cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
>>    cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
> Applied 1-8 to cgroup/for-7.0-fixes with the following minor fixups:
>
> - 5/8: Removed a duplicate test entry that resulted from the "S+"
>    removal (two previously-different lines becoming identical).
>
> - 8/8: Fixed typos in commit message ("essentally" -> "essentially",
>    "beforce" -> "before") and code comment ("top_cpuset_mutex" ->
>    "cpuset_top_mutex").
>
> This has gone through more than enough iterations. We can resolve further
> issues if there's any incrementally.

Thanks for fixing the errors.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues
  2026-02-23 21:11   ` Waiman Long
@ 2026-02-24  7:51     ` Chen Ridong
  0 siblings, 0 replies; 29+ messages in thread
From: Chen Ridong @ 2026-02-24  7:51 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo
  Cc: Johannes Weiner, Michal Koutny, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest



On 2026/2/24 5:11, Waiman Long wrote:
> 
> On 2/23/26 3:57 PM, Tejun Heo wrote:
>> Hello,
>>
>>> Waiman Long (8):
>>>    cgroup/cpuset: Fix incorrect change to effective_xcpus in
>>> partition_xcpus_del()
>>>    cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in
>>> update_cpumasks_hier()
>>>    cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
>>>    cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
>>>    kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
>>>    cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
>>>    cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to
>>> workqueue
>>>    cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
>> Applied 1-8 to cgroup/for-7.0-fixes with the following minor fixups:
>>
>> - 5/8: Removed a duplicate test entry that resulted from the "S+"
>>    removal (two previously-different lines becoming identical).
>>
>> - 8/8: Fixed typos in commit message ("essentally" -> "essentially",
>>    "beforce" -> "before") and code comment ("top_cpuset_mutex" ->
>>    "cpuset_top_mutex").
>>
>> This has gone through more than enough iterations. We can resolve further
>> issues if there's any incrementally.
> 
> Thanks for fixing the errors.
> 
> Cheers,
> Longman
> 

This series looks good to me, it's much clearer now.

Thanks.

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  2026-02-21 18:54 ` [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
@ 2026-02-26 15:00   ` Frederic Weisbecker
  0 siblings, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2026-02-26 15:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

Le Sat, Feb 21, 2026 at 01:54:13PM -0500, Waiman Long a écrit :
> Clarify the locking rules associated with file level internal variables
> inside the cpuset code. There is no functional change.
> 
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
  2026-02-21 18:54 ` [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
@ 2026-02-26 15:07   ` Frederic Weisbecker
  0 siblings, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2026-02-26 15:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

Le Sat, Feb 21, 2026 at 01:54:14PM -0500, Waiman Long a écrit :
> As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is
> a change in the set of isolated CPUs, making this change is now more
> costly than before.  Right now, the isolated_cpus_updating flag can be
> set even if there is no real change in isolated_cpus. Put in additional
> checks to make sure that isolated_cpus_updating is set only if there
> is a real change in isolated_cpus.
> 
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
  2026-02-21 18:54 ` [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together Waiman Long
@ 2026-02-26 15:51   ` Frederic Weisbecker
  0 siblings, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2026-02-26 15:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

Le Sat, Feb 21, 2026 at 01:54:16PM -0500, Waiman Long a écrit :
> With the latest changes in sched/isolation.c, rebuild_sched_domains*()
> requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly
> updated first, if needed, before the sched domains can be
> rebuilt. So the two naturally fit together. Do that by creating a new
> update_hk_sched_domains() helper to house both actions.
> 
> The name of the isolated_cpus_updating flag to control the
> call to housekeeping_update() is now outdated. So change it to
> update_housekeeping to better reflect its purpose. Also move the call
> to update_hk_sched_domains() to the end of cpuset and hotplug operations
> before releasing the cpuset_mutex.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
@ 2026-02-26 16:06   ` Frederic Weisbecker
  2026-03-03 16:00     ` Waiman Long
  2026-03-02 11:49   ` Frederic Weisbecker
  2026-03-03 15:18   ` Jon Hunter
  2 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2026-02-26 16:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

Le Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long a écrit :
> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
> for instance, when an isolated partition is invalidated because its
> last active CPU has been put offline.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock.

I am a bit confused here. Why would CPU hotplug path need to call
update_isolation_cpumasks() -> housekeeping_update() for
HK_TYPE_KERNEL_NOISE?

> So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the update_hk_sched_domains() function will be invoked via the
> hk_sd_workfn().
> 
> An concurrent cpuset control file write may have executed the required
> update_hk_sched_domains() function before the work function is called. So
> the work function call may become a no-op when it is invoked.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
>  .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3d0d18bf182f..2c80bfc30bbc 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>  		rebuild_sched_domains_locked();
>  }
>  
> +/*
> + * Work function to invoke update_hk_sched_domains()
> + */
> +static void hk_sd_workfn(struct work_struct *work)
> +{
> +	cpuset_full_lock();
> +	update_hk_sched_domains();
> +	cpuset_full_unlock();
> +}
> +
>  /**
>   * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>   * @parent: Parent cpuset containing all siblings
> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>   */
>  static void cpuset_handle_hotplug(void)
>  {
> +	static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>  	static cpumask_t new_cpus;
>  	static nodemask_t new_mems;
>  	bool cpus_updated, mems_updated;
> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>  	}
>  
>  
> -	if (update_housekeeping || force_sd_rebuild) {
> -		mutex_lock(&cpuset_mutex);
> -		update_hk_sched_domains();
> -		mutex_unlock(&cpuset_mutex);
> -	}
> +	/*
> +	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
> +	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
> +	 * cpumask can correctly reflect what is in isolated_cpus.
> +	 *
> +	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
> +	 * is still pending. Before the pending bit is cleared, the work data
> +	 * is copied out and work item dequeued. So it is possible to queue
> +	 * the work again before the hk_sd_workfn() is invoked to process the
> +	 * previously queued work. Since hk_sd_workfn() doesn't use the work
> +	 * item at all, this is not a problem.
> +	 */
> +	if (update_housekeeping || force_sd_rebuild)
> +		queue_work(system_unbound_wq, &hk_sd_work);

Nit about recent wq renames:

s/system_unbound_wq/system_dfl_wq

But what makes sure this work is executed by the end of the hotplug operations?
Is there a risk for a stale hierarchy to be observed when it shouldn't? Or a
stale housekeeping cpumask?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
  2026-02-26 16:06   ` Frederic Weisbecker
@ 2026-03-02 11:49   ` Frederic Weisbecker
  2026-03-03 15:18   ` Jon Hunter
  2 siblings, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-02 11:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long wrote:
> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
> for instance, when an isolated partition is invalidated because its
> last active CPU has been put offline.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock.

It would be nice to describe the deadlock scenario here.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-21 18:54 ` [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
@ 2026-03-02 12:14   ` Frederic Weisbecker
  2026-03-02 14:15     ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-02 12:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On Sat, Feb 21, 2026 at 01:54:18PM -0500, Waiman Long wrote:
> The current cpuset partition code is able to dynamically update
> the sched domains of a running system and the corresponding
> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
> "isolcpus=domain,..." boot command line feature at run time.
> 
> The housekeeping cpumask update requires flushing a number of different
> workqueues which may not be safe with cpus_read_lock() held as the
> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
> which have locking dependency with cpus_read_lock() down the chain. Below
> is an example of such circular locking problem.
> 
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   6.18.0-test+ #2 Tainted: G S
>   ------------------------------------------------------
>   test_cpuset_prs/10971 is trying to acquire lock:
>   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
> 
>   but task is already holding lock:
>   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
>   -> #4 (cpuset_mutex){+.+.}-{4:4}:
>   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>   -> #2 (rtnl_mutex){+.+.}-{4:4}:
>   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
> 
>   Chain exists of:
>     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

Which workqueue is involved here that holds rtnl_mutex?
Is this an existing problem or added test code?

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues
  2026-02-23 20:57 ` [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Tejun Heo
  2026-02-23 21:11   ` Waiman Long
@ 2026-03-02 12:21   ` Frederic Weisbecker
  1 sibling, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-02 12:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Chen Ridong, Johannes Weiner, Michal Koutny,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On Mon, Feb 23, 2026 at 10:57:24AM -1000, Tejun Heo wrote:
> Hello,
> 
> > Waiman Long (8):
> >   cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
> >   cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
> >   cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
> >   cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
> >   kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
> >   cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
> >   cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
> >   cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
> 
> Applied 1-8 to cgroup/for-7.0-fixes with the following minor fixups:
> 
> - 5/8: Removed a duplicate test entry that resulted from the "S+"
>   removal (two previously-different lines becoming identical).
> 
> - 8/8: Fixed typos in commit message ("essentally" -> "essentially",
>   "beforce" -> "before") and code comment ("top_cpuset_mutex" ->
>   "cpuset_top_mutex").
> 
> This has gone through more than enough iterations. We can resolve further
> issues if there's any incrementally.

We really need to check the fact that the workqueue is not flushed at any
relevant point in hotplug such that:

- offline CPU might now appear in the live topology, quite dangerous.

- CPUs might not be timely (un)isolated when they are expected to.

Thanks.

> 
> Thanks.
> 
> --
> tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-03-02 12:14   ` Frederic Weisbecker
@ 2026-03-02 14:15     ` Waiman Long
  2026-03-02 15:40       ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-03-02 14:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On 3/2/26 7:14 AM, Frederic Weisbecker wrote:
> On Sat, Feb 21, 2026 at 01:54:18PM -0500, Waiman Long wrote:
>> The current cpuset partition code is able to dynamically update
>> the sched domains of a running system and the corresponding
>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>> "isolcpus=domain,..." boot command line feature at run time.
>>
>> The housekeeping cpumask update requires flushing a number of different
>> workqueues which may not be safe with cpus_read_lock() held as the
>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>> which have locking dependency with cpus_read_lock() down the chain. Below
>> is an example of such circular locking problem.
>>
>>    ======================================================
>>    WARNING: possible circular locking dependency detected
>>    6.18.0-test+ #2 Tainted: G S
>>    ------------------------------------------------------
>>    test_cpuset_prs/10971 is trying to acquire lock:
>>    ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
>>
>>    but task is already holding lock:
>>    ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>>    which lock already depends on the new lock.
>>
>>    the existing dependency chain (in reverse order) is:
>>    -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>    -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>    -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>    -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>    -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>
>>    Chain exists of:
>>      (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
> Which workqueue is involved here that holds rtnl_mutex?
> Is this an existing problem or added test code?

Circular locking dependency here may not necessarily mean that 
rtnl_mutex is directly used in a work function.  However it can be used 
in a locking chain involving multiple parties that can result in a 
deadlock situation if they happen in the right order. So it is better 
safe that sorry even if the chance of this occurrence is minimal.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-03-02 14:15     ` Waiman Long
@ 2026-03-02 15:40       ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-03-02 15:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On 3/2/26 9:15 AM, Waiman Long wrote:
> On 3/2/26 7:14 AM, Frederic Weisbecker wrote:
>> On Sat, Feb 21, 2026 at 01:54:18PM -0500, Waiman Long wrote:
>>> The current cpuset partition code is able to dynamically update
>>> the sched domains of a running system and the corresponding
>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>> "isolcpus=domain,..." boot command line feature at run time.
>>>
>>> The housekeeping cpumask update requires flushing a number of different
>>> workqueues which may not be safe with cpus_read_lock() held as the
>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>> which have locking dependency with cpus_read_lock() down the chain. 
>>> Below
>>> is an example of such circular locking problem.
>>>
>>>    ======================================================
>>>    WARNING: possible circular locking dependency detected
>>>    6.18.0-test+ #2 Tainted: G S
>>>    ------------------------------------------------------
>>>    test_cpuset_prs/10971 is trying to acquire lock:
>>>    ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: 
>>> touch_wq_lockdep_map+0x7a/0x180
>>>
>>>    but task is already holding lock:
>>>    ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
>>> cpuset_partition_write+0x85/0x130
>>>
>>>    which lock already depends on the new lock.
>>>
>>>    the existing dependency chain (in reverse order) is:
>>>    -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>    -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>    -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>    -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>    -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>
>>>    Chain exists of:
>>>      (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>> Which workqueue is involved here that holds rtnl_mutex?
>> Is this an existing problem or added test code?
>
> Circular locking dependency here may not necessarily mean that 
> rtnl_mutex is directly used in a work function.  However it can be 
> used in a locking chain involving multiple parties that can result in 
> a deadlock situation if they happen in the right order. So it is 
> better safe that sorry even if the chance of this occurrence is minimal. 

Below is the full lockdep splat, I didn't include the individual stack 
traces to make the commit log less verbose.

The rtnl_mutex is indeed involved in local_pci_probe().

Cheers,
Longman

[  909.360022] ======================================================
[  909.366208] WARNING: possible circular locking dependency detected
[  909.372387] 7.0.0-rc1-test+ #3 Tainted: G S
[  909.378044] ------------------------------------------------------
[  909.384225] test_cpuset_prs/8673 is trying to acquire lock:
[  909.389798] ffff8890b0fd6558 ((wq_completion)sync_wq){+.+.}-{0:0}, 
at: touch_wq_lockdep_map+0x7a/0x180
[  909.399114]
                but task is already holding lock:
[  909.404946] ffffffffb9741c10 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130
[  909.413562]
                which lock already depends on the new lock.

[  909.421733]
                the existing dependency chain (in reverse order) is:
[  909.429213]
                -> #4 (cpuset_mutex){+.+.}-{4:4}:
[  909.435056]        __lock_acquire+0x58c/0xbd0
[  909.439421]        lock_acquire.part.0+0xbd/0x260
[  909.444129]        __mutex_lock+0x1a7/0x1ba0
[  909.448411]        cpuset_css_online+0x59/0x410
[  909.452948]        online_css+0x9b/0x2d0
[  909.456877]        css_create+0x3c6/0x610
[  909.460895]        cgroup_apply_control_enable+0x2ff/0x460
[  909.466384]        cgroup_subtree_control_write+0x79a/0xc70
[  909.471963]        cgroup_file_write+0x1a5/0x680
[  909.476582]        kernfs_fop_write_iter+0x3df/0x5f0
[  909.481550]        vfs_write+0x525/0xfd0
[  909.485482]        ksys_write+0xf9/0x1d0
[  909.489410]        do_syscall_64+0x13a/0x1520
[  909.493778]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  909.499361]
                -> #3 (cpu_hotplug_lock){++++}-{0:0}:
[  909.505547]        __lock_acquire+0x58c/0xbd0
[  909.509914]        lock_acquire.part.0+0xbd/0x260
[  909.514630]        cpus_read_lock+0x40/0xe0
[  909.518824]        flush_all_backlogs+0x83/0x4b0
[  909.523451] unregister_netdevice_many_notify+0x7e8/0x1fa0
[  909.529465]        default_device_exit_batch+0x356/0x490
[  909.534788]        ops_undo_list+0x2f4/0x930
[  909.539067]        cleanup_net+0x40a/0x8f0
[  909.543168]        process_one_work+0xd8b/0x1320
[  909.547795]        worker_thread+0x5f3/0xfe0
[  909.552068]        kthread+0x36c/0x470
[  909.555830]        ret_from_fork+0x5dc/0x8e0
[  909.560109]        ret_from_fork_asm+0x1a/0x30
[  909.564557]
                -> #2 (rtnl_mutex){+.+.}-{4:4}:
[  909.570224]        __lock_acquire+0x58c/0xbd0
[  909.574592]        lock_acquire.part.0+0xbd/0x260
[  909.579304]        __mutex_lock+0x1a7/0x1ba0
[  909.583580]        rtnl_net_lock_killable+0x1e/0x70
[  909.588465]        register_netdev+0x40/0x70
[  909.592738]        i40e_vsi_setup+0x892/0x14b0 [i40e]
[  909.597854]        i40e_setup_pf_switch+0xaa1/0xe80 [i40e]
[  909.603392]        i40e_probe.cold+0xdb0/0x1d1b [i40e]
[  909.608582]        local_pci_probe+0xdb/0x180
[  909.612951]        local_pci_probe_callback+0x35/0x80
[  909.618008]        process_one_work+0xd8b/0x1320
[  909.622631]        worker_thread+0x5f3/0xfe0
[  909.626912]        kthread+0x36c/0x470
[  909.630673]        ret_from_fork+0x5dc/0x8e0
[  909.634951]        ret_from_fork_asm+0x1a/0x30
[  909.639399]
                -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
[  909.646627]        __lock_acquire+0x58c/0xbd0
[  909.650994]        lock_acquire.part.0+0xbd/0x260
[  909.655699]        process_one_work+0xd58/0x1320
[  909.660321]        worker_thread+0x5f3/0xfe0
[  909.664602]        kthread+0x36c/0x470
[  909.668363]        ret_from_fork+0x5dc/0x8e0
[  909.672641]        ret_from_fork_asm+0x1a/0x30
[  909.677089]
                -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
[  909.683795]        check_prev_add+0xf1/0xc80
[  909.688068]        validate_chain+0x481/0x560
[  909.692431]        __lock_acquire+0x58c/0xbd0
[  909.696797]        lock_acquire.part.0+0xbd/0x260
[  909.701511]        touch_wq_lockdep_map+0x93/0x180
[  909.706314]        __flush_workqueue+0x111/0x10b0
[  909.711026]        housekeeping_update+0x12d/0x2d0
[  909.715819]        update_parent_effective_cpumask+0x595/0x2440
[  909.721747]        update_prstate+0x89d/0xce0
[  909.726105]        cpuset_partition_write+0xc5/0x130
[  909.731073]        cgroup_file_write+0x1a5/0x680
[  909.735701]        kernfs_fop_write_iter+0x3df/0x5f0
[  909.740664]        vfs_write+0x525/0xfd0
[  909.744592]        ksys_write+0xf9/0x1d0
[  909.748520]        do_syscall_64+0x13a/0x1520
[  909.752887]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  909.758465]
                other info that might help us debug this:

[  909.766466] Chain exists of:
                  (wq_completion)sync_wq --> cpu_hotplug_lock --> 
cpuset_mutex

[  909.777679]  Possible unsafe locking scenario:

[  909.783599]        CPU0                    CPU1
[  909.788130]        ----                    ----
[  909.792666]   lock(cpuset_mutex);
[  909.795991] lock(cpu_hotplug_lock);
[  909.802171]                                lock(cpuset_mutex);
[  909.808013]   lock((wq_completion)sync_wq);
[  909.812207]
                 *** DEADLOCK ***

[  909.818127] 5 locks held by test_cpuset_prs/8673:
[  909.822830]  #0: ffff888140592440 (sb_writers#7){.+.+}-{0:0}, at: 
ksys_write+0xf9/0x1d0
[  909.830839]  #1: ffff889100a49890 (&of->mutex#2){+.+.}-{4:4}, at: 
kernfs_fop_write_iter+0x260/0x5f0
[  909.839890]  #2: ffff8890fbfa5368 (kn->active#353){.+.+}-{0:0}, at: 
kernfs_fop_write_iter+0x2b6/0x5f0
[  909.849118]  #3: ffffffffb9134d00 (cpu_hotplug_lock){++++}-{0:0}, at: 
cpuset_partition_write+0x77/0x130
[  909.858522]  #4: ffffffffb9741c10 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130
[  909.867576]
                stack backtrace:
[  909.871940] CPU: 95 UID: 0 PID: 8673 Comm: test_cpuset_prs Kdump: 
loaded Tainted: G S                  7.0.0-rc1-test+ #3 PREEMPT(full)
[  909.871946] Tainted: [S]=CPU_OUT_OF_SPEC
[  909.871948] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS 
SE5C620.86B.0X.02.0001.043020191705 04/30/2019
[  909.871950] Call Trace:
[  909.871952]  <TASK>
[  909.871955]  dump_stack_lvl+0x6f/0xb0
[  909.871961]  print_circular_bug.cold+0x38/0x45
[  909.871968]  check_noncircular+0x146/0x160
[  909.871975]  check_prev_add+0xf1/0xc80
[  909.871978]  ? alloc_chain_hlocks+0x13e/0x1d0
[  909.871982]  ? add_chain_cache+0x11c/0x300
[  909.871986]  validate_chain+0x481/0x560
[  909.871991]  __lock_acquire+0x58c/0xbd0
[  909.871995]  ? lockdep_init_map_type+0x66/0x250
[  909.872000]  lock_acquire.part.0+0xbd/0x260
[  909.872004]  ? touch_wq_lockdep_map+0x7a/0x180
[  909.872009]  ? rcu_is_watching+0x15/0xb0
[  909.872013]  ? trace_rcu_sr_normal+0x1d5/0x2e0
[  909.872018]  ? touch_wq_lockdep_map+0x7a/0x180
[  909.872021]  ? lock_acquire+0x159/0x180
[  909.872026]  ? touch_wq_lockdep_map+0x7a/0x180
[  909.872030]  touch_wq_lockdep_map+0x93/0x180
[  909.872034]  ? touch_wq_lockdep_map+0x7a/0x180
[  909.872038]  __flush_workqueue+0x111/0x10b0
[  909.872042]  ? local_clock_noinstr+0xd/0xe0
[  909.872049]  ? __pfx___flush_workqueue+0x10/0x10
[  909.872059]  housekeeping_update+0x12d/0x2d0
[  909.872063]  update_parent_effective_cpumask+0x595/0x2440
[  909.872070]  update_prstate+0x89d/0xce0
[  909.872076]  ? __pfx_update_prstate+0x10/0x10
[  909.872085]  cpuset_partition_write+0xc5/0x130
[  909.872089]  cgroup_file_write+0x1a5/0x680
[  909.872093]  ? __pfx_cgroup_file_write+0x10/0x10
[  909.872097]  ? kernfs_fop_write_iter+0x2b6/0x5f0
[  909.872102]  ? __pfx_cgroup_file_write+0x10/0x10
[  909.872105]  kernfs_fop_write_iter+0x3df/0x5f0
[  909.872109]  vfs_write+0x525/0xfd0
[  909.872113]  ? __pfx_vfs_write+0x10/0x10
[  909.872118]  ? __lock_acquire+0x58c/0xbd0
[  909.872124]  ? find_held_lock+0x32/0x90
[  909.872130]  ksys_write+0xf9/0x1d0
[  909.872133]  ? __pfx_ksys_write+0x10/0x10
[  909.872136]  ? lockdep_hardirqs_on+0x78/0x100
[  909.872141]  ? do_syscall_64+0xde/0x1520
[  909.872146]  do_syscall_64+0x13a/0x1520
[  909.872151]  ? rcu_is_watching+0x15/0xb0
[  909.872154]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  909.872157]  ? lockdep_hardirqs_on+0x78/0x100
[  909.872161]  ? do_syscall_64+0x212/0x1520
[  909.872166]  ? find_held_lock+0x32/0x90
[  909.872170]  ? local_clock_noinstr+0xd/0xe0
[  909.872174]  ? __lock_release.isra.0+0x1a2/0x2c0
[  909.872178]  ? exc_page_fault+0x78/0xf0
[  909.872183]  ? rcu_is_watching+0x15/0xb0
[  909.872186]  ? trace_irq_enable.constprop.0+0x194/0x200
[  909.872191]  ? lockdep_hardirqs_on_prepare.part.0+0x8e/0x170
[  909.872196]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  909.872199] RIP: 0033:0x7f877d3e9544
[  909.872203] Code: 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 
00 0f 1f 40 00 f3 0f 1e fa 80 3d a5 cb 0d 00 00 74 13 b8 01 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
[  909.872206] RSP: 002b:00007ffd6ff21b28 EFLAGS: 00000202 ORIG_RAX: 
0000000000000001
[  909.872210] RAX: ffffffffffffffda RBX: 00007f877d4bf5c0 RCX: 
00007f877d3e9544
[  909.872213] RDX: 0000000000000009 RSI: 0000557ff7ec2320 RDI: 
0000000000000001
[  909.872215] RBP: 0000000000000009 R08: 0000000000000073 R09: 
00000000ffffffff
[  909.872217] R10: 0000000000000000 R11: 0000000000000202 R12: 
0000000000000009
[  909.872219] R13: 0000557ff7ec2320 R14: 0000000000000009 R15: 
00007f877d4bcf00
[  909.872226]  </TASK>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
  2026-02-26 16:06   ` Frederic Weisbecker
  2026-03-02 11:49   ` Frederic Weisbecker
@ 2026-03-03 15:18   ` Jon Hunter
  2026-03-03 16:09     ` Waiman Long
  2026-03-04  3:58     ` Waiman Long
  2 siblings, 2 replies; 29+ messages in thread
From: Jon Hunter @ 2026-03-03 15:18 UTC (permalink / raw)
  To: Waiman Long, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest,
	linux-tegra@vger.kernel.org

Hi Waiman,

On 21/02/2026 18:54, Waiman Long wrote:
> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
> for instance, when an isolated partition is invalidated because its
> last active CPU has been put offline.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the update_hk_sched_domains() function will be invoked via the
> hk_sd_workfn().
> 
> An concurrent cpuset control file write may have executed the required
> update_hk_sched_domains() function before the work function is called. So
> the work function call may become a no-op when it is invoked.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
>   .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
>   2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3d0d18bf182f..2c80bfc30bbc 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>   		rebuild_sched_domains_locked();
>   }
>   
> +/*
> + * Work function to invoke update_hk_sched_domains()
> + */
> +static void hk_sd_workfn(struct work_struct *work)
> +{
> +	cpuset_full_lock();
> +	update_hk_sched_domains();
> +	cpuset_full_unlock();
> +}
> +
>   /**
>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>    * @parent: Parent cpuset containing all siblings
> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>    */
>   static void cpuset_handle_hotplug(void)
>   {
> +	static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>   	static cpumask_t new_cpus;
>   	static nodemask_t new_mems;
>   	bool cpus_updated, mems_updated;
> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>   	}
>   
>   
> -	if (update_housekeeping || force_sd_rebuild) {
> -		mutex_lock(&cpuset_mutex);
> -		update_hk_sched_domains();
> -		mutex_unlock(&cpuset_mutex);
> -	}
> +	/*
> +	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
> +	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
> +	 * cpumask can correctly reflect what is in isolated_cpus.
> +	 *
> +	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
> +	 * is still pending. Before the pending bit is cleared, the work data
> +	 * is copied out and work item dequeued. So it is possible to queue
> +	 * the work again before the hk_sd_workfn() is invoked to process the
> +	 * previously queued work. Since hk_sd_workfn() doesn't use the work
> +	 * item at all, this is not a problem.
> +	 */
> +	if (update_housekeeping || force_sd_rebuild)
> +		queue_work(system_unbound_wq, &hk_sd_work);
> +
>   	free_tmpmasks(ptmp);
>   }
>   
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 0c5db118f2d1..dc2dff361ec6 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -246,6 +246,9 @@ TEST_MATRIX=(
>   	"  C2-3:P1  C3:P1   .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
>   	"  C2-3:P1  C3:P1   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
>   	"  C2-3:P1  C3:P1   .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
> +	"  C2-3:P1  C3:P2   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
> +	"  C1-3:P1  C3:P2   .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
> +	"  C1-3:P1  C3:P2   .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
>   	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>   	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>   	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -762,7 +765,7 @@ check_cgroup_states()
>   # only CPUs in isolated partitions as well as those that are isolated at
>   # boot time.
>   #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>   # <isolcpus1> - expected sched/domains value
>   # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
>   #
> @@ -771,6 +774,7 @@ check_isolcpus()
>   	EXPECTED_ISOLCPUS=$1
>   	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>   	ISOLCPUS=$(cat $ISCPUS)
> +	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>   	LASTISOLCPU=
>   	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>   	if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -808,6 +812,11 @@ check_isolcpus()
>   	ISOLCPUS=
>   	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>   
> +	#
> +	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> +	#
> +	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
>   	#
>   	# Use the sched domain in debugfs to check isolated CPUs, if available
>   	#

We have a CPU hotplug test that cycles through all CPUs off-lining them 
and on-lining them in different combinations. Since this change was 
added to -next, this test is failing on our Tegra210 boards. Bisecting 
the issue, it pointed to this commit and reverting this on top of -next 
fixes the issue.

The test is quite simple and part of Thierry's tegra-test suite [0].

$ ./tegra-tests/tests/cpu.py --verbose hotplug
cpu: hotplug: CPU#0: mask: 1
cpu: hotplug: CPU#1: mask: 2
cpu: hotplug: CPU#2: mask: 4
cpu: hotplug: CPU#3: mask: 8
cpu: hotplug: applying mask 0xf
cpu: hotplug: applying mask 0xe
cpu: hotplug: applying mask 0xd
cpu: hotplug: applying mask 0xc
cpu: hotplug: applying mask 0xb
cpu: hotplug: applying mask 0xa
...
cpu: hotplug: applying mask 0x1
Traceback (most recent call last):
   File "./tegra-tests/tests/cpu.py", line 159, in <module>
     runner.standalone(module)
   File "./tegra-tests/tests/runner.py", line 147, in standalone
     log.test(log = log, args = args)
   File "./tegra-tests/tests/cpu.py", line 29, in __call__
     cpus.apply_mask(mask)
   File "./tegra-tests/linux/system.py", line 149, in apply_mask
     cpu.set_online(False)
   File "./tegra-tests/linux/system.py", line 45, in set_online
     self.online = online
OSError: [Errno 16] Device or resource busy

 From looking at different runs it appears to fail at different places.

Let me know if you have any thoughts.

Thanks
Jon

[0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py

-- 
nvpublic


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-26 16:06   ` Frederic Weisbecker
@ 2026-03-03 16:00     ` Waiman Long
  2026-03-03 22:48       ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-03-03 16:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On 2/26/26 11:06 AM, Frederic Weisbecker wrote:
> Le Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long a écrit :
>> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
>> for instance, when an isolated partition is invalidated because its
>> last active CPU has been put offline.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() directly
>> from update_isolation_cpumasks() will likely cause deadlock.
> I am a bit confused here. Why would CPU hotplug path need to call
> update_isolation_cpumasks() -> housekeeping_update() for
> HK_TYPE_KERNEL_NOISE?

Oh, this is not the current behavior. However, to make nohz_full fully 
dynamically changeable in the near future, we will have to do that 
eventually.

Cheers,
Longman


>> So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the update_hk_sched_domains() function will be invoked via the
>> hk_sd_workfn().
>>
>> An concurrent cpuset control file write may have executed the required
>> update_hk_sched_domains() function before the work function is called. So
>> the work function call may become a no-op when it is invoked.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
>>   .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
>>   2 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 3d0d18bf182f..2c80bfc30bbc 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>>   		rebuild_sched_domains_locked();
>>   }
>>   
>> +/*
>> + * Work function to invoke update_hk_sched_domains()
>> + */
>> +static void hk_sd_workfn(struct work_struct *work)
>> +{
>> +	cpuset_full_lock();
>> +	update_hk_sched_domains();
>> +	cpuset_full_unlock();
>> +}
>> +
>>   /**
>>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>>    * @parent: Parent cpuset containing all siblings
>> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>>    */
>>   static void cpuset_handle_hotplug(void)
>>   {
>> +	static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>>   	static cpumask_t new_cpus;
>>   	static nodemask_t new_mems;
>>   	bool cpus_updated, mems_updated;
>> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>>   	}
>>   
>>   
>> -	if (update_housekeeping || force_sd_rebuild) {
>> -		mutex_lock(&cpuset_mutex);
>> -		update_hk_sched_domains();
>> -		mutex_unlock(&cpuset_mutex);
>> -	}
>> +	/*
>> +	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
>> +	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
>> +	 * cpumask can correctly reflect what is in isolated_cpus.
>> +	 *
>> +	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
>> +	 * is still pending. Before the pending bit is cleared, the work data
>> +	 * is copied out and work item dequeued. So it is possible to queue
>> +	 * the work again before the hk_sd_workfn() is invoked to process the
>> +	 * previously queued work. Since hk_sd_workfn() doesn't use the work
>> +	 * item at all, this is not a problem.
>> +	 */
>> +	if (update_housekeeping || force_sd_rebuild)
>> +		queue_work(system_unbound_wq, &hk_sd_work);
> Nit about recent wq renames:
>
> s/system_unbound_wq/system_dfl_wq
Good point. Will send additional patch to do the rename.
>
> But what makes sure this work is executed by the end of the hotplug operations?
> Is there a risk for a stale hierarchy to be observed when it shouldn't? Or a
> stale housekeeping cpumask?

If you look at the work function, it will make a copy of HK_TYPE_DOMAIN 
cpumask while holding rcu_read_lock(). So the current hotplug operation 
must have finished at that point. Of course, if there is another 
hot-add/remove operation right after the rcu_read_lock is released, the 
cpumask passed down to housekeeping_update() may not be the latest one. 
In this case, another work will be scheduled to call 
housekeeping_update() with the new cpumask again.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-03-03 15:18   ` Jon Hunter
@ 2026-03-03 16:09     ` Waiman Long
  2026-03-04  3:58     ` Waiman Long
  1 sibling, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-03-03 16:09 UTC (permalink / raw)
  To: Jon Hunter, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest,
	linux-tegra@vger.kernel.org

On 3/3/26 10:18 AM, Jon Hunter wrote:
> Hi Waiman,
>
> On 21/02/2026 18:54, Waiman Long wrote:
>> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
>> for instance, when an isolated partition is invalidated because its
>> last active CPU has been put offline.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() 
>> directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the update_hk_sched_domains() function will be invoked via the
>> hk_sd_workfn().
>>
>> An concurrent cpuset control file write may have executed the required
>> update_hk_sched_domains() function before the work function is 
>> called. So
>> the work function call may become a no-op when it is invoked.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
>>   .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
>>   2 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 3d0d18bf182f..2c80bfc30bbc 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>>           rebuild_sched_domains_locked();
>>   }
>>   +/*
>> + * Work function to invoke update_hk_sched_domains()
>> + */
>> +static void hk_sd_workfn(struct work_struct *work)
>> +{
>> +    cpuset_full_lock();
>> +    update_hk_sched_domains();
>> +    cpuset_full_unlock();
>> +}
>> +
>>   /**
>>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by 
>> sibling cpusets
>>    * @parent: Parent cpuset containing all siblings
>> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct 
>> cpuset *cs, struct tmpmasks *tmp)
>>    */
>>   static void cpuset_handle_hotplug(void)
>>   {
>> +    static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>>       static cpumask_t new_cpus;
>>       static nodemask_t new_mems;
>>       bool cpus_updated, mems_updated;
>> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>>       }
>>     -    if (update_housekeeping || force_sd_rebuild) {
>> -        mutex_lock(&cpuset_mutex);
>> -        update_hk_sched_domains();
>> -        mutex_unlock(&cpuset_mutex);
>> -    }
>> +    /*
>> +     * Queue a work to call housekeeping_update() & 
>> rebuild_sched_domains()
>> +     * There will be a slight delay before the HK_TYPE_DOMAIN 
>> housekeeping
>> +     * cpumask can correctly reflect what is in isolated_cpus.
>> +     *
>> +     * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item 
>> that
>> +     * is still pending. Before the pending bit is cleared, the work 
>> data
>> +     * is copied out and work item dequeued. So it is possible to queue
>> +     * the work again before the hk_sd_workfn() is invoked to 
>> process the
>> +     * previously queued work. Since hk_sd_workfn() doesn't use the 
>> work
>> +     * item at all, this is not a problem.
>> +     */
>> +    if (update_housekeeping || force_sd_rebuild)
>> +        queue_work(system_unbound_wq, &hk_sd_work);
>> +
>>       free_tmpmasks(ptmp);
>>   }
>>   diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh 
>> b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> index 0c5db118f2d1..dc2dff361ec6 100755
>> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> @@ -246,6 +246,9 @@ TEST_MATRIX=(
>>       "  C2-3:P1  C3:P1   .      .     O3=0    .      . .     0 
>> A1:2|A2: A1:P1|A2:P1"
>>       "  C2-3:P1  C3:P1   .      .    T:O2=0   .      . .     0 
>> A1:3|A2:3 A1:P1|A2:P-1"
>>       "  C2-3:P1  C3:P1   .      .      .    T:O3=0   . .     0 
>> A1:2|A2:2 A1:P1|A2:P-1"
>> +    "  C2-3:P1  C3:P2   .      .    T:O2=0   .      . .     0 
>> A1:3|A2:3 A1:P1|A2:P-2"
>> +    "  C1-3:P1  C3:P2   .      .      .    T:O3=0   . .     0 
>> A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
>> +    "  C1-3:P1  C3:P2   .      .      .    T:O3=0  O3=1 .     0 
>> A1:1-2|A2:3 A1:P1|A2:P2  3"
>>       "$SETUP_A123_PARTITIONS    .     O1=0    .      . .     0 
>> A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>>       "$SETUP_A123_PARTITIONS    .     O2=0    .      . .     0 
>> A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>>       "$SETUP_A123_PARTITIONS    .     O3=0    .      . .     0 
>> A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
>> @@ -762,7 +765,7 @@ check_cgroup_states()
>>   # only CPUs in isolated partitions as well as those that are 
>> isolated at
>>   # boot time.
>>   #
>> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
>> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>>   # <isolcpus1> - expected sched/domains value
>>   # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not 
>> defined
>>   #
>> @@ -771,6 +774,7 @@ check_isolcpus()
>>       EXPECTED_ISOLCPUS=$1
>>       ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>>       ISOLCPUS=$(cat $ISCPUS)
>> +    HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>>       LASTISOLCPU=
>>       SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>>       if [[ $EXPECTED_ISOLCPUS = . ]]
>> @@ -808,6 +812,11 @@ check_isolcpus()
>>       ISOLCPUS=
>>       EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>>   +    #
>> +    # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match 
>> $ISOLCPUS
>> +    #
>> +    [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
>> +
>>       #
>>       # Use the sched domain in debugfs to check isolated CPUs, if 
>> available
>>       #
>
> We have a CPU hotplug test that cycles through all CPUs off-lining 
> them and on-lining them in different combinations. Since this change 
> was added to -next, this test is failing on our Tegra210 boards. 
> Bisecting the issue, it pointed to this commit and reverting this on 
> top of -next fixes the issue.
>
> The test is quite simple and part of Thierry's tegra-test suite [0].
>
> $ ./tegra-tests/tests/cpu.py --verbose hotplug
> cpu: hotplug: CPU#0: mask: 1
> cpu: hotplug: CPU#1: mask: 2
> cpu: hotplug: CPU#2: mask: 4
> cpu: hotplug: CPU#3: mask: 8
> cpu: hotplug: applying mask 0xf
> cpu: hotplug: applying mask 0xe
> cpu: hotplug: applying mask 0xd
> cpu: hotplug: applying mask 0xc
> cpu: hotplug: applying mask 0xb
> cpu: hotplug: applying mask 0xa
> ...
> cpu: hotplug: applying mask 0x1
> Traceback (most recent call last):
>   File "./tegra-tests/tests/cpu.py", line 159, in <module>
>     runner.standalone(module)
>   File "./tegra-tests/tests/runner.py", line 147, in standalone
>     log.test(log = log, args = args)
>   File "./tegra-tests/tests/cpu.py", line 29, in __call__
>     cpus.apply_mask(mask)
>   File "./tegra-tests/linux/system.py", line 149, in apply_mask
>     cpu.set_online(False)
>   File "./tegra-tests/linux/system.py", line 45, in set_online
>     self.online = online
> OSError: [Errno 16] Device or resource busy
>
> From looking at different runs it appears to fail at different places.
>
> Let me know if you have any thoughts.
>
> Thanks
> Jon
>
> [0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py

Thanks for the report. Will take a further look into this problem and 
report back what I find.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-03-03 16:00     ` Waiman Long
@ 2026-03-03 22:48       ` Frederic Weisbecker
  2026-03-04  4:05         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2026-03-03 22:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

Le Tue, Mar 03, 2026 at 11:00:54AM -0500, Waiman Long a écrit :
> On 2/26/26 11:06 AM, Frederic Weisbecker wrote:
> If you look at the work function, it will make a copy of HK_TYPE_DOMAIN
> cpumask while holding rcu_read_lock().

Where?

> So the current hotplug operation must
> have finished at that point.

I'm confused. This is called from sched_cpu_deactivate(), right?
So the work is scheduled at that point. But the work does cpuset_full_lock()
which includes cpu hotplug read lock, so the sched domain rebuild can only
happen at the end of cpu_down().

This means that between CPUHP_TEARDOWN_CPU and CPUHP_OFFLINE, the offline
CPU still appears in the scheduler topology because the scheduler domains
haven't been rebuilt.

And even if the work wouldn't cpu hotplug read lock, what guarantees that
it executes before reaching CPUHP_TEARDOWN_CPU?

> Of course, if there is another hot-add/remove
> operation right after the rcu_read_lock is released, the cpumask passed down
> to housekeeping_update() may not be the latest one. In this case, another
> work will be scheduled to call housekeeping_update() with the new cpumask
> again.

I'm not so much worried about housekeeping_update() (yet). I'm worried about
topology rebuild to happen before CPUHP_TEARDOWN_CPU. Offline CPUs shouldn't
exist in the topology.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-03-03 15:18   ` Jon Hunter
  2026-03-03 16:09     ` Waiman Long
@ 2026-03-04  3:58     ` Waiman Long
  2026-03-04 11:07       ` Jon Hunter
  1 sibling, 1 reply; 29+ messages in thread
From: Waiman Long @ 2026-03-04  3:58 UTC (permalink / raw)
  To: Jon Hunter, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest,
	linux-tegra@vger.kernel.org

On 3/3/26 10:18 AM, Jon Hunter wrote:
> Hi Waiman,
>
> On 21/02/2026 18:54, Waiman Long wrote:
>> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
>> for instance, when an isolated partition is invalidated because its
>> last active CPU has been put offline.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() 
>> directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the update_hk_sched_domains() function will be invoked via the
>> hk_sd_workfn().
>>
>> An concurrent cpuset control file write may have executed the required
>> update_hk_sched_domains() function before the work function is 
>> called. So
>> the work function call may become a no-op when it is invoked.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
>>   .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
>>   2 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 3d0d18bf182f..2c80bfc30bbc 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>>           rebuild_sched_domains_locked();
>>   }
>>   +/*
>> + * Work function to invoke update_hk_sched_domains()
>> + */
>> +static void hk_sd_workfn(struct work_struct *work)
>> +{
>> +    cpuset_full_lock();
>> +    update_hk_sched_domains();
>> +    cpuset_full_unlock();
>> +}
>> +
>>   /**
>>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by 
>> sibling cpusets
>>    * @parent: Parent cpuset containing all siblings
>> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct 
>> cpuset *cs, struct tmpmasks *tmp)
>>    */
>>   static void cpuset_handle_hotplug(void)
>>   {
>> +    static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>>       static cpumask_t new_cpus;
>>       static nodemask_t new_mems;
>>       bool cpus_updated, mems_updated;
>> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>>       }
>>     -    if (update_housekeeping || force_sd_rebuild) {
>> -        mutex_lock(&cpuset_mutex);
>> -        update_hk_sched_domains();
>> -        mutex_unlock(&cpuset_mutex);
>> -    }
>> +    /*
>> +     * Queue a work to call housekeeping_update() & 
>> rebuild_sched_domains()
>> +     * There will be a slight delay before the HK_TYPE_DOMAIN 
>> housekeeping
>> +     * cpumask can correctly reflect what is in isolated_cpus.
>> +     *
>> +     * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item 
>> that
>> +     * is still pending. Before the pending bit is cleared, the work 
>> data
>> +     * is copied out and work item dequeued. So it is possible to queue
>> +     * the work again before the hk_sd_workfn() is invoked to 
>> process the
>> +     * previously queued work. Since hk_sd_workfn() doesn't use the 
>> work
>> +     * item at all, this is not a problem.
>> +     */
>> +    if (update_housekeeping || force_sd_rebuild)
>> +        queue_work(system_unbound_wq, &hk_sd_work);
>> +
>>       free_tmpmasks(ptmp);
>>   }
>>   diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh 
>> b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> index 0c5db118f2d1..dc2dff361ec6 100755
>> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> @@ -246,6 +246,9 @@ TEST_MATRIX=(
>>       "  C2-3:P1  C3:P1   .      .     O3=0    .      . .     0 
>> A1:2|A2: A1:P1|A2:P1"
>>       "  C2-3:P1  C3:P1   .      .    T:O2=0   .      . .     0 
>> A1:3|A2:3 A1:P1|A2:P-1"
>>       "  C2-3:P1  C3:P1   .      .      .    T:O3=0   . .     0 
>> A1:2|A2:2 A1:P1|A2:P-1"
>> +    "  C2-3:P1  C3:P2   .      .    T:O2=0   .      . .     0 
>> A1:3|A2:3 A1:P1|A2:P-2"
>> +    "  C1-3:P1  C3:P2   .      .      .    T:O3=0   . .     0 
>> A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
>> +    "  C1-3:P1  C3:P2   .      .      .    T:O3=0  O3=1 .     0 
>> A1:1-2|A2:3 A1:P1|A2:P2  3"
>>       "$SETUP_A123_PARTITIONS    .     O1=0    .      . .     0 
>> A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>>       "$SETUP_A123_PARTITIONS    .     O2=0    .      . .     0 
>> A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>>       "$SETUP_A123_PARTITIONS    .     O3=0    .      . .     0 
>> A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
>> @@ -762,7 +765,7 @@ check_cgroup_states()
>>   # only CPUs in isolated partitions as well as those that are 
>> isolated at
>>   # boot time.
>>   #
>> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
>> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>>   # <isolcpus1> - expected sched/domains value
>>   # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not 
>> defined
>>   #
>> @@ -771,6 +774,7 @@ check_isolcpus()
>>       EXPECTED_ISOLCPUS=$1
>>       ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>>       ISOLCPUS=$(cat $ISCPUS)
>> +    HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>>       LASTISOLCPU=
>>       SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>>       if [[ $EXPECTED_ISOLCPUS = . ]]
>> @@ -808,6 +812,11 @@ check_isolcpus()
>>       ISOLCPUS=
>>       EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>>   +    #
>> +    # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match 
>> $ISOLCPUS
>> +    #
>> +    [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
>> +
>>       #
>>       # Use the sched domain in debugfs to check isolated CPUs, if 
>> available
>>       #
>
> We have a CPU hotplug test that cycles through all CPUs off-lining 
> them and on-lining them in different combinations. Since this change 
> was added to -next, this test is failing on our Tegra210 boards. 
> Bisecting the issue, it pointed to this commit and reverting this on 
> top of -next fixes the issue.
>
> The test is quite simple and part of Thierry's tegra-test suite [0].
>
> $ ./tegra-tests/tests/cpu.py --verbose hotplug
> cpu: hotplug: CPU#0: mask: 1
> cpu: hotplug: CPU#1: mask: 2
> cpu: hotplug: CPU#2: mask: 4
> cpu: hotplug: CPU#3: mask: 8
> cpu: hotplug: applying mask 0xf
> cpu: hotplug: applying mask 0xe
> cpu: hotplug: applying mask 0xd
> cpu: hotplug: applying mask 0xc
> cpu: hotplug: applying mask 0xb
> cpu: hotplug: applying mask 0xa
> ...
> cpu: hotplug: applying mask 0x1
> Traceback (most recent call last):
>   File "./tegra-tests/tests/cpu.py", line 159, in <module>
>     runner.standalone(module)
>   File "./tegra-tests/tests/runner.py", line 147, in standalone
>     log.test(log = log, args = args)
>   File "./tegra-tests/tests/cpu.py", line 29, in __call__
>     cpus.apply_mask(mask)
>   File "./tegra-tests/linux/system.py", line 149, in apply_mask
>     cpu.set_online(False)
>   File "./tegra-tests/linux/system.py", line 45, in set_online
>     self.online = online
> OSError: [Errno 16] Device or resource busy
>
> From looking at different runs it appears to fail at different places.
>
> Let me know if you have any thoughts.
>
> Thanks
> Jon
>
> [0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py
>
It looks that -EBUSY was returned when the script tries to 
online/offline a CPU. I ran a simple script to repetitively doing 
offline/online operation and couldn't reproduce the problem. I don't 
have access to the tegra board that you use for testing. Would you mind 
trying out the following patch to see if it can get rid of the problem.

Thanks,
Longman

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e200de7c60b6..5a5953fb391c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3936,8 +3936,10 @@ static void cpuset_handle_hotplug(void)
          * previously queued work. Since hk_sd_workfn() doesn't use the 
work
          * item at all, this is not a problem.
          */
-       if (update_housekeeping || force_sd_rebuild)
-               queue_work(system_unbound_wq, &hk_sd_work);
+       if (force_sd_rebuild)
+               rebuild_sched_domains_cpuslocked();
+       if (update_housekeeping)
+               queue_work(system_dfl_wq, &hk_sd_work);

         free_tmpmasks(ptmp);
  }



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-03-03 22:48       ` Frederic Weisbecker
@ 2026-03-04  4:05         ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-03-04  4:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Shuah Khan, cgroups, linux-kernel,
	linux-kselftest

On 3/3/26 5:48 PM, Frederic Weisbecker wrote:
> Le Tue, Mar 03, 2026 at 11:00:54AM -0500, Waiman Long a écrit :
>> On 2/26/26 11:06 AM, Frederic Weisbecker wrote:
>> If you look at the work function, it will make a copy of HK_TYPE_DOMAIN
>> cpumask while holding rcu_read_lock().
> Where?
>
>> So the current hotplug operation must
>> have finished at that point.
> I'm confused. This is called from sched_cpu_deactivate(), right?
> So the work is scheduled at that point. But the work does cpuset_full_lock()
> which includes cpu hotplug read lock, so the sched domain rebuild can only
> happen at the end of cpu_down().
>
> This means that between CPUHP_TEARDOWN_CPU and CPUHP_OFFLINE, the offline
> CPU still appears in the scheduler topology because the scheduler domains
> haven't been rebuilt.
>
> And even if the work wouldn't cpu hotplug read lock, what guarantees that
> it executes before reaching CPUHP_TEARDOWN_CPU?
>
>> Of course, if there is another hot-add/remove
>> operation right after the rcu_read_lock is released, the cpumask passed down
>> to housekeeping_update() may not be the latest one. In this case, another
>> work will be scheduled to call housekeeping_update() with the new cpumask
>> again.
> I'm not so much worried about housekeeping_update() (yet). I'm worried about
> topology rebuild to happen before CPUHP_TEARDOWN_CPU. Offline CPUs shouldn't
> exist in the topology.

Yes, I am aware that this could be a problem. I am working on a fix 
patch that will always do a rebuild_sched_domains_cpuslocked() call 
directly in the hotplug path if needed as shown in the patch that I sent 
to Jon. I want to get a confirmation first before I send it out. There 
will be other minor code/comment adjustments as well.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-03-04  3:58     ` Waiman Long
@ 2026-03-04 11:07       ` Jon Hunter
  2026-03-04 18:11         ` Waiman Long
  0 siblings, 1 reply; 29+ messages in thread
From: Jon Hunter @ 2026-03-04 11:07 UTC (permalink / raw)
  To: Waiman Long, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest,
	linux-tegra@vger.kernel.org


On 04/03/2026 03:58, Waiman Long wrote:

...

> It looks that -EBUSY was returned when the script tries to online/ 
> offline a CPU. I ran a simple script to repetitively doing offline/ 
> online operation and couldn't reproduce the problem. I don't have access 
> to the tegra board that you use for testing. Would you mind trying out 
> the following patch to see if it can get rid of the problem.
> 
> Thanks,
> Longman
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index e200de7c60b6..5a5953fb391c 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3936,8 +3936,10 @@ static void cpuset_handle_hotplug(void)
>           * previously queued work. Since hk_sd_workfn() doesn't use the 
> work
>           * item at all, this is not a problem.
>           */
> -       if (update_housekeeping || force_sd_rebuild)
> -               queue_work(system_unbound_wq, &hk_sd_work);
> +       if (force_sd_rebuild)
> +               rebuild_sched_domains_cpuslocked();
> +       if (update_housekeeping)
> +               queue_work(system_dfl_wq, &hk_sd_work);
> 
>          free_tmpmasks(ptmp);
>   }
> 
> 

Yes that did the trick. Works for me. Feel free to add my ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-03-04 11:07       ` Jon Hunter
@ 2026-03-04 18:11         ` Waiman Long
  0 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2026-03-04 18:11 UTC (permalink / raw)
  To: Jon Hunter, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest,
	linux-tegra@vger.kernel.org


On 3/4/26 6:07 AM, Jon Hunter wrote:
>
> On 04/03/2026 03:58, Waiman Long wrote:
>
> ...
>
>> It looks that -EBUSY was returned when the script tries to online/ 
>> offline a CPU. I ran a simple script to repetitively doing offline/ 
>> online operation and couldn't reproduce the problem. I don't have 
>> access to the tegra board that you use for testing. Would you mind 
>> trying out the following patch to see if it can get rid of the problem.
>>
>> Thanks,
>> Longman
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index e200de7c60b6..5a5953fb391c 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -3936,8 +3936,10 @@ static void cpuset_handle_hotplug(void)
>>           * previously queued work. Since hk_sd_workfn() doesn't use 
>> the work
>>           * item at all, this is not a problem.
>>           */
>> -       if (update_housekeeping || force_sd_rebuild)
>> -               queue_work(system_unbound_wq, &hk_sd_work);
>> +       if (force_sd_rebuild)
>> +               rebuild_sched_domains_cpuslocked();
>> +       if (update_housekeeping)
>> +               queue_work(system_dfl_wq, &hk_sd_work);
>>
>>          free_tmpmasks(ptmp);
>>   }
>>
>>
>
> Yes that did the trick. Works for me. Feel free to add my ...
>
> Tested-by: Jon Hunter <jonathanh@nvidia.com> 

Thanks for the confirmation.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-04 18:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
2026-02-21 18:54 ` [PATCH v6 1/8] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
2026-02-21 18:54 ` [PATCH v6 2/8] cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() Waiman Long
2026-02-21 18:54 ` [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
2026-02-26 15:00   ` Frederic Weisbecker
2026-02-21 18:54 ` [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
2026-02-26 15:07   ` Frederic Weisbecker
2026-02-21 18:54 ` [PATCH v6 5/8] kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command Waiman Long
2026-02-21 18:54 ` [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together Waiman Long
2026-02-26 15:51   ` Frederic Weisbecker
2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
2026-02-26 16:06   ` Frederic Weisbecker
2026-03-03 16:00     ` Waiman Long
2026-03-03 22:48       ` Frederic Weisbecker
2026-03-04  4:05         ` Waiman Long
2026-03-02 11:49   ` Frederic Weisbecker
2026-03-03 15:18   ` Jon Hunter
2026-03-03 16:09     ` Waiman Long
2026-03-04  3:58     ` Waiman Long
2026-03-04 11:07       ` Jon Hunter
2026-03-04 18:11         ` Waiman Long
2026-02-21 18:54 ` [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
2026-03-02 12:14   ` Frederic Weisbecker
2026-03-02 14:15     ` Waiman Long
2026-03-02 15:40       ` Waiman Long
2026-02-23 20:57 ` [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Tejun Heo
2026-02-23 21:11   ` Waiman Long
2026-02-24  7:51     ` Chen Ridong
2026-03-02 12:21   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox