[PATCH/for-next v2 0/2] cgroup/cpuset: Fix partition related locking issues

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH/for-next v2 0/2] cgroup/cpuset: Fix partition related locking issues
@ 2026-01-30 15:42 Waiman Long
  2026-01-30 15:42 ` [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue Waiman Long
  2026-01-30 15:42 ` [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex Waiman Long
  0 siblings, 2 replies; 23+ messages in thread
From: Waiman Long @ 2026-01-30 15:42 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

 v2:
  - Change patch 1 to use workqueue instead of task run as it is a
    per-cpu kthread that performs the cpuset shutdown and bringup work.
  - Simplify and streamline some of the code.

After booting the latest cgroup for-next debug kernel with the latest
cgroup changes as well as Federic's "cpuset/isolation: Honour kthreads
preferred affinity" patch series [1] merged on top and running the
test-cpuset-prs.sh test, a circular locking dependency lockdep splat
was reported. See patch 2 for details.

The following changes are made to resolve this locking problem.
 1) Deferring calling housekeeping_update() from CPU hotplug to workqueue
 2) Release cpus_read_lock before calling housekeeping_update()

With these changes, the cpuset test ran to completion with no failure
and no lockdep splat.

[1] https://lore.kernel.org/lkml/20260125224541.50226-1-frederic@kernel.org/

Waiman Long (2):
  cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to
    workqueue
  cgroup/cpuset: Introduce a new top level cpuset_top_mutex

 kernel/cgroup/cpuset.c                        | 124 ++++++++++++++----
 kernel/sched/isolation.c                      |   4 +-
 kernel/time/timer_migration.c                 |   3 +-
 .../selftests/cgroup/test_cpuset_prs.sh       |  13 +-
 4 files changed, 107 insertions(+), 37 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-30 15:42 [PATCH/for-next v2 0/2] cgroup/cpuset: Fix partition related locking issues Waiman Long
@ 2026-01-30 15:42 ` Waiman Long
  2026-01-31  0:47   ` Chen Ridong
                     ` (2 more replies)
  2026-01-30 15:42 ` [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex Waiman Long
  1 sibling, 3 replies; 23+ messages in thread
From: Waiman Long @ 2026-01-30 15:42 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The update_isolation_cpumasks() function can be called either directly
from regular cpuset control file write with cpuset_full_lock() called
or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.

As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the actual housekeeping_update() call, if needed, will happen after
cpus_write_lock is released.

We can't use the synchronous task_work API as call from CPU hotplug
path happen in the per-cpu kthread of the CPU that is being shut down
or brought up. Because of the asynchronous nature of workqueue, the
HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
"cpuset.cpus.isolated" control file in this case.

Also add a check in test_cpuset_prs.sh and modify some existing
test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
housekeeping cpumask will both be updated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
 .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
 2 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7b7d12ab1006..0b0eb1df09d5 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -84,6 +84,9 @@ static cpumask_var_t	isolated_cpus;
  */
 static bool isolated_cpus_updating;
 
+/* Both cpuset_mutex and cpus_read_locked acquired */
+static bool cpuset_locked;
+
 /*
  * A flag to force sched domain rebuild at the end of an operation.
  * It can be set in
@@ -285,10 +288,12 @@ void cpuset_full_lock(void)
 {
 	cpus_read_lock();
 	mutex_lock(&cpuset_mutex);
+	cpuset_locked = true;
 }
 
 void cpuset_full_unlock(void)
 {
+	cpuset_locked = false;
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
 }
@@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 	return false;
 }
 
+static void isolcpus_workfn(struct work_struct *work)
+{
+	cpuset_full_lock();
+	if (isolated_cpus_updating) {
+		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
+		isolated_cpus_updating = false;
+	}
+	cpuset_full_unlock();
+}
+
 /*
  * update_isolation_cpumasks - Update external isolation related CPU masks
  *
@@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
  */
 static void update_isolation_cpumasks(void)
 {
-	int ret;
+	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
 
 	if (!isolated_cpus_updating)
 		return;
 
-	ret = housekeeping_update(isolated_cpus);
-	WARN_ON_ONCE(ret < 0);
+	/*
+	 * This function can be reached either directly from regular cpuset
+	 * control file write (cpuset_locked) or via hotplug (cpus_write_lock
+	 * && cpuset_mutex held). In the later case, we defer the
+	 * housekeeping_update() call to the system_unbound_wq to avoid the
+	 * possibility of deadlock. This also means that there will be a short
+	 * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
+	 * behind isolated_cpus.
+	 */
+	if (!cpuset_locked) {
+		/*
+		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
+		 * item that is still pending.
+		 */
+		queue_work(system_unbound_wq, &isolcpus_work);
+		return;
+	}
 
+	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
 	isolated_cpus_updating = false;
 }
 
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 5dff3ad53867..0502b156582b 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -245,8 +245,9 @@ TEST_MATRIX=(
 	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
 	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
 	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
-	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
+	"C2-3:P1:S+  C3:P2  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
+	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
+	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
 	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -764,7 +765,7 @@ check_cgroup_states()
 # only CPUs in isolated partitions as well as those that are isolated at
 # boot time.
 #
-# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
 # <isolcpus1> - expected sched/domains value
 # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
 #
@@ -773,6 +774,7 @@ check_isolcpus()
 	EXPECTED_ISOLCPUS=$1
 	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
 	ISOLCPUS=$(cat $ISCPUS)
+	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
 	LASTISOLCPU=
 	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
 	if [[ $EXPECTED_ISOLCPUS = . ]]
@@ -810,6 +812,11 @@ check_isolcpus()
 	ISOLCPUS=
 	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
 
+	#
+	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
+	#
+	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
+
 	#
 	# Use the sched domain in debugfs to check isolated CPUs, if available
 	#
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-01-30 15:42 [PATCH/for-next v2 0/2] cgroup/cpuset: Fix partition related locking issues Waiman Long
  2026-01-30 15:42 ` [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue Waiman Long
@ 2026-01-30 15:42 ` Waiman Long
  2026-01-31  2:53   ` Chen Ridong
  1 sibling, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-01-30 15:42 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.18.0-test+ #2 Tainted: G S
  ------------------------------------------------------
  test_cpuset_prs/10971 is trying to acquire lock:
  ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180

  but task is already holding lock:
  ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:
  -> #4 (cpuset_mutex){+.+.}-{4:4}:
  -> #3 (cpu_hotplug_lock){++++}-{0:0}:
  -> #2 (rtnl_mutex){+.+.}-{4:4}:
  -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
  -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

  Chain exists of:
    (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

  5 locks held by test_cpuset_prs/10971:
   #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
   #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
   #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
   #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
   #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  Call Trace:
   <TASK>
     :
   touch_wq_lockdep_map+0x93/0x180
   __flush_workqueue+0x111/0x10b0
   housekeeping_update+0x12d/0x2d0
   update_parent_effective_cpumask+0x595/0x2440
   update_prstate+0x89d/0xce0
   cpuset_partition_write+0xc5/0x130
   cgroup_file_write+0x1a5/0x680
   kernfs_fop_write_iter+0x3df/0x5f0
   vfs_write+0x525/0xfd0
   ksys_write+0xf9/0x1d0
   do_syscall_64+0x95/0x520
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

One way to do that is to introduce a new top level cpuset_top_mutex
which will be acquired first.  This new cpuset_top_mutex will provide
the need mutual exclusion without the need to hold cpus_read_lock().

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to check the new
cpuset_top_mutex.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c        | 101 +++++++++++++++++++++++-----------
 kernel/sched/isolation.c      |   4 +-
 kernel/time/timer_migration.c |   3 +-
 3 files changed, 70 insertions(+), 38 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0b0eb1df09d5..edccfa2df9da 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -78,13 +78,13 @@ static cpumask_var_t	subpartitions_cpus;
 static cpumask_var_t	isolated_cpus;
 
 /*
- * isolated_cpus updating flag (protected by cpuset_mutex)
+ * isolated_cpus updating flag (protected by cpuset_top_mutex)
  * Set if isolated_cpus is going to be updated in the current
  * cpuset_mutex crtical section.
  */
 static bool isolated_cpus_updating;
 
-/* Both cpuset_mutex and cpus_read_locked acquired */
+/* cpuset_top_mutex acquired */
 static bool cpuset_locked;
 
 /*
@@ -222,29 +222,44 @@ struct cpuset top_cpuset = {
 };
 
 /*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
- * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
- * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
- * structures. Note that cpuset_mutex needs to be a mutex as it is used in
- * paths that rely on priority inheritance (e.g. scheduler - on RT) for
- * correctness.
+ * CPUSET Locking Convention
+ * -------------------------
  *
- * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_mutex, it blocks others, ensuring that it is the only task able to
- * also acquire callback_lock and be able to modify cpusets.  It can perform
- * various checks on the cpuset structure first, knowing nothing will change.
- * It can also allocate memory while just holding cpuset_mutex.  While it is
- * performing these checks, various callback routines can briefly acquire
- * callback_lock to query cpusets.  Once it is ready to make the changes, it
- * takes callback_lock, blocking everyone else.
+ * Below are the four global locks guarding cpuset structures in lock
+ * acquisition order:
+ *  - cpuset_top_mutex
+ *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
+ *  - cpuset_mutex
+ *  - callback_lock (raw spinlock)
  *
- * Calls to the kernel memory allocator can not be made while holding
- * callback_lock, as that would risk double tripping on callback_lock
- * from one of the callbacks into the cpuset code from within
- * __alloc_pages().
+ * The first cpuset_top_mutex will be held except when calling into
+ * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
+ * and cpuset_mutex will be held instead.
  *
- * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
+ * As cpuset will now indirectly flush a number of different workqueues in
+ * housekeeping_update() when the set of isolated CPUs is going to be changed,
+ * it may not be safe from the circular locking perspective to hold the
+ * cpus_read_lock. So cpus_read_lock and cpuset_mutex will be released before
+ * calling housekeeping_update() and re-acquired afterward.
+ *
+ * A task must hold all the remaining three locks to modify externally visible
+ * or used fields of cpusets, though some of the internally used cpuset fields
+ * can be modified without holding callback_lock. If only reliable read access
+ * of the externally used fields are needed, a task can hold either
+ * cpuset_mutex or callback_lock which are exposed to other subsystems.
+ *
+ * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
+ * ensuring that it is the only task able to also acquire callback_lock and
+ * be able to modify cpusets.  It can perform various checks on the cpuset
+ * structure first, knowing nothing will change. It can also allocate memory
+ * without holding callback_lock. While it is performing these checks, various
+ * callback routines can briefly acquire callback_lock to query cpusets.  Once
+ * it is ready to make the changes, it takes callback_lock, blocking everyone
+ * else.
+ *
+ * Calls to the kernel memory allocator cannot be made while holding
+ * callback_lock which is a spinlock, as the memory allocator may sleep or
+ * call back into cpuset code and acquire callback_lock.
  *
  * Now, the task_struct fields mems_allowed and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
@@ -255,6 +270,7 @@ struct cpuset top_cpuset = {
  * cpumasks and nodemasks.
  */
 
+static DEFINE_MUTEX(cpuset_top_mutex);
 static DEFINE_MUTEX(cpuset_mutex);
 
 /**
@@ -278,6 +294,18 @@ void lockdep_assert_cpuset_lock_held(void)
 	lockdep_assert_held(&cpuset_mutex);
 }
 
+static void cpuset_partial_lock(void)
+{
+	cpus_read_lock();
+	mutex_lock(&cpuset_mutex);
+}
+
+static void cpuset_partial_unlock(void)
+{
+	mutex_unlock(&cpuset_mutex);
+	cpus_read_unlock();
+}
+
 /**
  * cpuset_full_lock - Acquire full protection for cpuset modification
  *
@@ -286,22 +314,22 @@ void lockdep_assert_cpuset_lock_held(void)
  */
 void cpuset_full_lock(void)
 {
-	cpus_read_lock();
-	mutex_lock(&cpuset_mutex);
+	mutex_lock(&cpuset_top_mutex);
+	cpuset_partial_lock();
 	cpuset_locked = true;
 }
 
 void cpuset_full_unlock(void)
 {
 	cpuset_locked = false;
-	mutex_unlock(&cpuset_mutex);
-	cpus_read_unlock();
+	cpuset_partial_unlock();
+	mutex_unlock(&cpuset_top_mutex);
 }
 
 #ifdef CONFIG_LOCKDEP
 bool lockdep_is_cpuset_held(void)
 {
-	return lockdep_is_held(&cpuset_mutex);
+	return lockdep_is_held(&cpuset_top_mutex);
 }
 #endif
 
@@ -1292,12 +1320,12 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 
 static void isolcpus_workfn(struct work_struct *work)
 {
-	cpuset_full_lock();
-	if (isolated_cpus_updating) {
-		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
-		isolated_cpus_updating = false;
-	}
-	cpuset_full_unlock();
+	guard(mutex)(&cpuset_top_mutex);
+	if (!isolated_cpus_updating)
+		return;
+
+	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
+	isolated_cpus_updating = false;
 }
 
 /*
@@ -1331,8 +1359,15 @@ static void update_isolation_cpumasks(void)
 		return;
 	}
 
+	lockdep_assert_held(&cpuset_top_mutex);
+	/*
+	 * Release cpus_read_lock & cpuset_mutex before calling
+	 * housekeeping_update() and re-acquiring them afterward.
+	 */
+	cpuset_partial_unlock();
 	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
 	isolated_cpus_updating = false;
+	cpuset_partial_lock();
 }
 
 /**
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b725d39c06e..ef152d401fe2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
 	struct cpumask *trial, *old = NULL;
 	int err;
 
-	lockdep_assert_cpus_held();
-
 	trial = kmalloc(cpumask_size(), GFP_KERNEL);
 	if (!trial)
 		return -ENOMEM;
@@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
 	}
 
 	if (!housekeeping.flags)
-		static_branch_enable_cpuslocked(&housekeeping_overridden);
+		static_branch_enable(&housekeeping_overridden);
 
 	if (housekeeping.flags & HK_FLAG_DOMAIN)
 		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 6da9cd562b20..244a8d025e78 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
 	int cpu;
 
-	lockdep_assert_cpus_held();
-
 	if (!works)
 		return -ENOMEM;
 	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
@@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	 * First set previously isolated CPUs as available (unisolate).
 	 * This cpumask contains only CPUs that switched to available now.
 	 */
+	guard(cpus_read_lock)();
 	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
 	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-30 15:42 ` [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue Waiman Long
@ 2026-01-31  0:47   ` Chen Ridong
  2026-01-31  1:06     ` Waiman Long
  2026-01-31  0:58   ` Chen Ridong
  2026-02-02 13:05   ` Peter Zijlstra
  2 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-01-31  0:47 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/1/30 23:42, Waiman Long wrote:
> The update_isolation_cpumasks() function can be called either directly
> from regular cpuset control file write with cpuset_full_lock() called
> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the actual housekeeping_update() call, if needed, will happen after
> cpus_write_lock is released.
> 
> We can't use the synchronous task_work API as call from CPU hotplug
> path happen in the per-cpu kthread of the CPU that is being shut down
> or brought up. Because of the asynchronous nature of workqueue, the
> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
> "cpuset.cpus.isolated" control file in this case.
> 
> Also add a check in test_cpuset_prs.sh and modify some existing
> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
> housekeeping cpumask will both be updated.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>  .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>  2 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 7b7d12ab1006..0b0eb1df09d5 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -84,6 +84,9 @@ static cpumask_var_t	isolated_cpus;
>   */
>  static bool isolated_cpus_updating;
>  
> +/* Both cpuset_mutex and cpus_read_locked acquired */
> +static bool cpuset_locked;
> +
>  /*
>   * A flag to force sched domain rebuild at the end of an operation.
>   * It can be set in
> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>  {
>  	cpus_read_lock();
>  	mutex_lock(&cpuset_mutex);
> +	cpuset_locked = true;
>  }
>  
>  void cpuset_full_unlock(void)
>  {
> +	cpuset_locked = false;
>  	mutex_unlock(&cpuset_mutex);
>  	cpus_read_unlock();
>  }
> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  	return false;
>  }
>  
> +static void isolcpus_workfn(struct work_struct *work)
> +{
> +	cpuset_full_lock();
> +	if (isolated_cpus_updating) {
> +		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> +		isolated_cpus_updating = false;
> +	}
> +	cpuset_full_unlock();
> +}
> +
>  /*
>   * update_isolation_cpumasks - Update external isolation related CPU masks
>   *
> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>   */
>  static void update_isolation_cpumasks(void)
>  {
> -	int ret;
> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>  
>  	if (!isolated_cpus_updating)
>  		return;
>  

Can this happen?

cpu0					cpu1
[...]

isolated_cpus_updating = true;
...
// 'full_lock' is not acquired
update_isolation_cpumasks
					// exec worker concurrently
					isolcpus_workfn
					cpuset_full_lock
					isolated_cpus_updating = false;
					cpuset_full_unlock();
// This returns uncorrectly
if (!isolated_cpus_updating)
	return;

> -	ret = housekeeping_update(isolated_cpus);
> -	WARN_ON_ONCE(ret < 0);
> +	/*
> +	 * This function can be reached either directly from regular cpuset
> +	 * control file write (cpuset_locked) or via hotplug (cpus_write_lock
> +	 * && cpuset_mutex held). In the later case, we defer the
> +	 * housekeeping_update() call to the system_unbound_wq to avoid the
> +	 * possibility of deadlock. This also means that there will be a short
> +	 * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
> +	 * behind isolated_cpus.
> +	 */
> +	if (!cpuset_locked) {
> +		/*
> +		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
> +		 * item that is still pending.
> +		 */
> +		queue_work(system_unbound_wq, &isolcpus_work);
> +		return;
> +	}
>  
> +	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>  	isolated_cpus_updating = false;
>  }
>  
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 5dff3ad53867..0502b156582b 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -245,8 +245,9 @@ TEST_MATRIX=(
>  	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
>  	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
>  	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
> -	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
> -	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
> +	"C2-3:P1:S+  C3:P2  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
> +	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
> +	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
>  	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>  	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>  	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -764,7 +765,7 @@ check_cgroup_states()
>  # only CPUs in isolated partitions as well as those that are isolated at
>  # boot time.
>  #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>  # <isolcpus1> - expected sched/domains value
>  # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
>  #
> @@ -773,6 +774,7 @@ check_isolcpus()
>  	EXPECTED_ISOLCPUS=$1
>  	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>  	ISOLCPUS=$(cat $ISCPUS)
> +	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>  	LASTISOLCPU=
>  	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>  	if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -810,6 +812,11 @@ check_isolcpus()
>  	ISOLCPUS=
>  	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>  
> +	#
> +	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> +	#
> +	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
>  	#
>  	# Use the sched domain in debugfs to check isolated CPUs, if available
>  	#

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-30 15:42 ` [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue Waiman Long
  2026-01-31  0:47   ` Chen Ridong
@ 2026-01-31  0:58   ` Chen Ridong
  2026-01-31  1:45     ` Waiman Long
  2026-02-02 13:05   ` Peter Zijlstra
  2 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-01-31  0:58 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/1/30 23:42, Waiman Long wrote:
> The update_isolation_cpumasks() function can be called either directly
> from regular cpuset control file write with cpuset_full_lock() called
> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the actual housekeeping_update() call, if needed, will happen after
> cpus_write_lock is released.
> 
> We can't use the synchronous task_work API as call from CPU hotplug
> path happen in the per-cpu kthread of the CPU that is being shut down
> or brought up. Because of the asynchronous nature of workqueue, the
> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
> "cpuset.cpus.isolated" control file in this case.
> 
> Also add a check in test_cpuset_prs.sh and modify some existing
> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
> housekeeping cpumask will both be updated.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>  .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>  2 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 7b7d12ab1006..0b0eb1df09d5 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -84,6 +84,9 @@ static cpumask_var_t	isolated_cpus;
>   */
>  static bool isolated_cpus_updating;
>  
> +/* Both cpuset_mutex and cpus_read_locked acquired */
> +static bool cpuset_locked;
> +
>  /*
>   * A flag to force sched domain rebuild at the end of an operation.
>   * It can be set in
> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>  {
>  	cpus_read_lock();
>  	mutex_lock(&cpuset_mutex);
> +	cpuset_locked = true;
>  }
>  
>  void cpuset_full_unlock(void)
>  {
> +	cpuset_locked = false;
>  	mutex_unlock(&cpuset_mutex);
>  	cpus_read_unlock();
>  }
> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  	return false;
>  }
>  
> +static void isolcpus_workfn(struct work_struct *work)
> +{
> +	cpuset_full_lock();
> +	if (isolated_cpus_updating) {
> +		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> +		isolated_cpus_updating = false;
> +	}
> +	cpuset_full_unlock();
> +}
> +
>  /*
>   * update_isolation_cpumasks - Update external isolation related CPU masks
>   *
> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>   */
>  static void update_isolation_cpumasks(void)
>  {
> -	int ret;
> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>  
>  	if (!isolated_cpus_updating)
>  		return;
>  
> -	ret = housekeeping_update(isolated_cpus);
> -	WARN_ON_ONCE(ret < 0);
> +	/*
> +	 * This function can be reached either directly from regular cpuset
> +	 * control file write (cpuset_locked) or via hotplug (cpus_write_lock
> +	 * && cpuset_mutex held). In the later case, we defer the
> +	 * housekeeping_update() call to the system_unbound_wq to avoid the
> +	 * possibility of deadlock. This also means that there will be a short
> +	 * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
> +	 * behind isolated_cpus.
> +	 */
> +	if (!cpuset_locked) {

Adding a global variable makes this difficult to handle, especially in
concurrent scenarios, since we could read it outside of a critical region.

I suggest removing cpuset_locked and adding async_update_isolation_cpumasks
instead, which can indicate to the caller it should call without holding the
full lock.

> +		/*
> +		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
> +		 * item that is still pending.
> +		 */
> +		queue_work(system_unbound_wq, &isolcpus_work);
> +		return;
> +	}
>  
> +	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>  	isolated_cpus_updating = false;
>  }
>  
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 5dff3ad53867..0502b156582b 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -245,8 +245,9 @@ TEST_MATRIX=(
>  	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
>  	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
>  	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
> -	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
> -	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
> +	"C2-3:P1:S+  C3:P2  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
> +	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
> +	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
>  	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>  	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>  	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -764,7 +765,7 @@ check_cgroup_states()
>  # only CPUs in isolated partitions as well as those that are isolated at
>  # boot time.
>  #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>  # <isolcpus1> - expected sched/domains value
>  # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
>  #
> @@ -773,6 +774,7 @@ check_isolcpus()
>  	EXPECTED_ISOLCPUS=$1
>  	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>  	ISOLCPUS=$(cat $ISCPUS)
> +	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>  	LASTISOLCPU=
>  	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>  	if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -810,6 +812,11 @@ check_isolcpus()
>  	ISOLCPUS=
>  	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>  
> +	#
> +	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> +	#
> +	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
>  	#
>  	# Use the sched domain in debugfs to check isolated CPUs, if available
>  	#

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31  0:47   ` Chen Ridong
@ 2026-01-31  1:06     ` Waiman Long
  2026-01-31  1:43       ` Chen Ridong
  0 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-01-31  1:06 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest


On 1/30/26 7:47 PM, Chen Ridong wrote:
>
> On 2026/1/30 23:42, Waiman Long wrote:
>> The update_isolation_cpumasks() function can be called either directly
>> from regular cpuset control file write with cpuset_full_lock() called
>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
Note this statement.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the actual housekeeping_update() call, if needed, will happen after
>> cpus_write_lock is released.
>>
>> We can't use the synchronous task_work API as call from CPU hotplug
>> path happen in the per-cpu kthread of the CPU that is being shut down
>> or brought up. Because of the asynchronous nature of workqueue, the
>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>> "cpuset.cpus.isolated" control file in this case.
>>
>> Also add a check in test_cpuset_prs.sh and modify some existing
>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>> housekeeping cpumask will both be updated.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>   .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>   2 files changed, 44 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 7b7d12ab1006..0b0eb1df09d5 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -84,6 +84,9 @@ static cpumask_var_t	isolated_cpus;
>>    */
>>   static bool isolated_cpus_updating;
>>   
>> +/* Both cpuset_mutex and cpus_read_locked acquired */
>> +static bool cpuset_locked;
>> +
>>   /*
>>    * A flag to force sched domain rebuild at the end of an operation.
>>    * It can be set in
>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>   {
>>   	cpus_read_lock();
>>   	mutex_lock(&cpuset_mutex);
>> +	cpuset_locked = true;
>>   }
>>   
>>   void cpuset_full_unlock(void)
>>   {
>> +	cpuset_locked = false;
>>   	mutex_unlock(&cpuset_mutex);
>>   	cpus_read_unlock();
>>   }
>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>>   	return false;
>>   }
>>   
>> +static void isolcpus_workfn(struct work_struct *work)
>> +{
>> +	cpuset_full_lock();
>> +	if (isolated_cpus_updating) {
>> +		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>> +		isolated_cpus_updating = false;
>> +	}
>> +	cpuset_full_unlock();
>> +}
>> +
>>   /*
>>    * update_isolation_cpumasks - Update external isolation related CPU masks
>>    *
>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>>    */
>>   static void update_isolation_cpumasks(void)
>>   {
>> -	int ret;
>> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>   
>>   	if (!isolated_cpus_updating)
>>   		return;
>>   
> Can this happen?
>
> cpu0					cpu1
> [...]
>
> isolated_cpus_updating = true;
> ...
> // 'full_lock' is not acquired
> update_isolation_cpumasks
That is not true. Either cpus_read_lock or cpus_write_lock and 
cpuset_mutex are held when update_isolation_cpumasks() is called. So 
there is mutual exclusion.
> 					// exec worker concurrently
> 					isolcpus_workfn
> 					cpuset_full_lock
> 					isolated_cpus_updating = false;
> 					cpuset_full_unlock();
> // This returns uncorrectly
> if (!isolated_cpus_updating)
> 	return;
>
Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31  1:06     ` Waiman Long
@ 2026-01-31  1:43       ` Chen Ridong
  2026-01-31  1:49         ` Chen Ridong
  0 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-01-31  1:43 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/1/31 9:06, Waiman Long wrote:
> 
> On 1/30/26 7:47 PM, Chen Ridong wrote:
>>
>> On 2026/1/30 23:42, Waiman Long wrote:
>>> The update_isolation_cpumasks() function can be called either directly
>>> from regular cpuset control file write with cpuset_full_lock() called
>>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> Note this statement.

Thank you for reminder.

>>>
>>> As we are going to enable dynamic update to the nozh_full housekeeping
>>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>>> allowing the CPU hotplug path to call into housekeeping_update() directly
>>> from update_isolation_cpumasks() will likely cause deadlock. So we
>>> have to defer any call to housekeeping_update() after the CPU hotplug
>>> operation has finished. This is now done via the workqueue where
>>> the actual housekeeping_update() call, if needed, will happen after
>>> cpus_write_lock is released.
>>>
>>> We can't use the synchronous task_work API as call from CPU hotplug
>>> path happen in the per-cpu kthread of the CPU that is being shut down
>>> or brought up. Because of the asynchronous nature of workqueue, the
>>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>>> "cpuset.cpus.isolated" control file in this case.
>>>
>>> Also add a check in test_cpuset_prs.sh and modify some existing
>>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>>> housekeeping cpumask will both be updated.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>>   .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>>   2 files changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index 7b7d12ab1006..0b0eb1df09d5 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -84,6 +84,9 @@ static cpumask_var_t    isolated_cpus;
>>>    */
>>>   static bool isolated_cpus_updating;
>>>   +/* Both cpuset_mutex and cpus_read_locked acquired */
>>> +static bool cpuset_locked;
>>> +
>>>   /*
>>>    * A flag to force sched domain rebuild at the end of an operation.
>>>    * It can be set in
>>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>>   {
>>>       cpus_read_lock();
>>>       mutex_lock(&cpuset_mutex);
>>> +    cpuset_locked = true;
>>>   }
>>>     void cpuset_full_unlock(void)
>>>   {
>>> +    cpuset_locked = false;
>>>       mutex_unlock(&cpuset_mutex);
>>>       cpus_read_unlock();
>>>   }
>>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate,
>>> struct cpumask *new_cpus)
>>>       return false;
>>>   }
>>>   +static void isolcpus_workfn(struct work_struct *work)
>>> +{
>>> +    cpuset_full_lock();
>>> +    if (isolated_cpus_updating) {
>>> +        WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>> +        isolated_cpus_updating = false;
>>> +    }
>>> +    cpuset_full_unlock();
>>> +}
>>> +
>>>   /*
>>>    * update_isolation_cpumasks - Update external isolation related CPU masks
>>>    *
>>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int
>>> prstate, struct cpumask *new_cpus)
>>>    */
>>>   static void update_isolation_cpumasks(void)
>>>   {
>>> -    int ret;
>>> +    static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>>         if (!isolated_cpus_updating)
>>>           return;
>>>   
>> Can this happen?
>>
>> cpu0                    cpu1
>> [...]
>>
>> isolated_cpus_updating = true;
>> ...
>> // 'full_lock' is not acquired
>> update_isolation_cpumasks
> That is not true. Either cpus_read_lock or cpus_write_lock and cpuset_mutex are
> held when update_isolation_cpumasks() is called. So there is mutual exclusion.

Eh, we currently assume that it can only be called from existing scenarios, so
it's okay for now. But I'm concerned that if we later use
update_isolation_cpumasks without realizing that we need to hold either
cpus_write_lock or (cpus_read_lock && cpuset_mutex) , we could run into
concurrency issues. Maybe I'm worrying too much.

And maybe we shuold add 'lockdep_assert_held' inside the  update_isolation_cpumasks.

>>                     // exec worker concurrently
>>                     isolcpus_workfn
>>                     cpuset_full_lock
>>                     isolated_cpus_updating = false;
>>                     cpuset_full_unlock();
>> // This returns uncorrectly
>> if (!isolated_cpus_updating)
>>     return;
>>
> Cheers,
> Longman
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31  0:58   ` Chen Ridong
@ 2026-01-31  1:45     ` Waiman Long
  2026-01-31  2:05       ` Chen Ridong
  0 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-01-31  1:45 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 1/30/26 7:58 PM, Chen Ridong wrote:
>
> On 2026/1/30 23:42, Waiman Long wrote:
>> The update_isolation_cpumasks() function can be called either directly
>> from regular cpuset control file write with cpuset_full_lock() called
>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the actual housekeeping_update() call, if needed, will happen after
>> cpus_write_lock is released.
>>
>> We can't use the synchronous task_work API as call from CPU hotplug
>> path happen in the per-cpu kthread of the CPU that is being shut down
>> or brought up. Because of the asynchronous nature of workqueue, the
>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>> "cpuset.cpus.isolated" control file in this case.
>>
>> Also add a check in test_cpuset_prs.sh and modify some existing
>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>> housekeeping cpumask will both be updated.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>   .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>   2 files changed, 44 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 7b7d12ab1006..0b0eb1df09d5 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -84,6 +84,9 @@ static cpumask_var_t	isolated_cpus;
>>    */
>>   static bool isolated_cpus_updating;
>>   
>> +/* Both cpuset_mutex and cpus_read_locked acquired */
>> +static bool cpuset_locked;
>> +
>>   /*
>>    * A flag to force sched domain rebuild at the end of an operation.
>>    * It can be set in
>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>   {
>>   	cpus_read_lock();
>>   	mutex_lock(&cpuset_mutex);
>> +	cpuset_locked = true;
>>   }
>>   
>>   void cpuset_full_unlock(void)
>>   {
>> +	cpuset_locked = false;
>>   	mutex_unlock(&cpuset_mutex);
>>   	cpus_read_unlock();
>>   }
>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>>   	return false;
>>   }
>>   
>> +static void isolcpus_workfn(struct work_struct *work)
>> +{
>> +	cpuset_full_lock();
>> +	if (isolated_cpus_updating) {
>> +		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>> +		isolated_cpus_updating = false;
>> +	}
>> +	cpuset_full_unlock();
>> +}
>> +
>>   /*
>>    * update_isolation_cpumasks - Update external isolation related CPU masks
>>    *
>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>>    */
>>   static void update_isolation_cpumasks(void)
>>   {
>> -	int ret;
>> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>   
>>   	if (!isolated_cpus_updating)
>>   		return;
>>   
>> -	ret = housekeeping_update(isolated_cpus);
>> -	WARN_ON_ONCE(ret < 0);
>> +	/*
>> +	 * This function can be reached either directly from regular cpuset
>> +	 * control file write (cpuset_locked) or via hotplug (cpus_write_lock
>> +	 * && cpuset_mutex held). In the later case, we defer the
>> +	 * housekeeping_update() call to the system_unbound_wq to avoid the
>> +	 * possibility of deadlock. This also means that there will be a short
>> +	 * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
>> +	 * behind isolated_cpus.
>> +	 */
>> +	if (!cpuset_locked) {
> Adding a global variable makes this difficult to handle, especially in
> concurrent scenarios, since we could read it outside of a critical region.
No, cpuset_locked is always read from or written into inside a critical 
section. It is under cpuset_mutex up to this point and then with the 
cpuset_top_mutex with the next patch.
>
> I suggest removing cpuset_locked and adding async_update_isolation_cpumasks
> instead, which can indicate to the caller it should call without holding the
> full lock.

The point of this global variable is to distinguish between calling from 
CPU hotplug and the other regular cpuset code paths. The only difference 
between these two are having cpus_read_lock or cpus_write_lock held. 
That is why I think adding a global variable in cpuset_full_lock() is 
the easy way. Otherwise, we will to add extra argument to some of the 
functions to distinguish these two cases.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31  1:43       ` Chen Ridong
@ 2026-01-31  1:49         ` Chen Ridong
  0 siblings, 0 replies; 23+ messages in thread
From: Chen Ridong @ 2026-01-31  1:49 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/1/31 9:43, Chen Ridong wrote:
> 
> 
> On 2026/1/31 9:06, Waiman Long wrote:
>>
>> On 1/30/26 7:47 PM, Chen Ridong wrote:
>>>
>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>> The update_isolation_cpumasks() function can be called either directly
>>>> from regular cpuset control file write with cpuset_full_lock() called
>>>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>> Note this statement.
> 
> Thank you for reminder.
> 
>>>>
>>>> As we are going to enable dynamic update to the nozh_full housekeeping
>>>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>>>> allowing the CPU hotplug path to call into housekeeping_update() directly
>>>> from update_isolation_cpumasks() will likely cause deadlock. So we
>>>> have to defer any call to housekeeping_update() after the CPU hotplug
>>>> operation has finished. This is now done via the workqueue where
>>>> the actual housekeeping_update() call, if needed, will happen after
>>>> cpus_write_lock is released.
>>>>
>>>> We can't use the synchronous task_work API as call from CPU hotplug
>>>> path happen in the per-cpu kthread of the CPU that is being shut down
>>>> or brought up. Because of the asynchronous nature of workqueue, the
>>>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>>>> "cpuset.cpus.isolated" control file in this case.
>>>>
>>>> Also add a check in test_cpuset_prs.sh and modify some existing
>>>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>>>> housekeeping cpumask will both be updated.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>>   kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>>>   .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>>>   2 files changed, 44 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>> index 7b7d12ab1006..0b0eb1df09d5 100644
>>>> --- a/kernel/cgroup/cpuset.c
>>>> +++ b/kernel/cgroup/cpuset.c
>>>> @@ -84,6 +84,9 @@ static cpumask_var_t    isolated_cpus;
>>>>    */
>>>>   static bool isolated_cpus_updating;
>>>>   +/* Both cpuset_mutex and cpus_read_locked acquired */
>>>> +static bool cpuset_locked;
>>>> +
>>>>   /*
>>>>    * A flag to force sched domain rebuild at the end of an operation.
>>>>    * It can be set in
>>>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>>>   {
>>>>       cpus_read_lock();
>>>>       mutex_lock(&cpuset_mutex);
>>>> +    cpuset_locked = true;
>>>>   }
>>>>     void cpuset_full_unlock(void)
>>>>   {
>>>> +    cpuset_locked = false;
>>>>       mutex_unlock(&cpuset_mutex);
>>>>       cpus_read_unlock();
>>>>   }
>>>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate,
>>>> struct cpumask *new_cpus)
>>>>       return false;
>>>>   }
>>>>   +static void isolcpus_workfn(struct work_struct *work)
>>>> +{
>>>> +    cpuset_full_lock();
>>>> +    if (isolated_cpus_updating) {
>>>> +        WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>>> +        isolated_cpus_updating = false;
>>>> +    }
>>>> +    cpuset_full_unlock();
>>>> +}
>>>> +
>>>>   /*
>>>>    * update_isolation_cpumasks - Update external isolation related CPU masks
>>>>    *
>>>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int
>>>> prstate, struct cpumask *new_cpus)
>>>>    */
>>>>   static void update_isolation_cpumasks(void)
>>>>   {
>>>> -    int ret;
>>>> +    static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>>>         if (!isolated_cpus_updating)
>>>>           return;
>>>>   
>>> Can this happen?
>>>
>>> cpu0                    cpu1
>>> [...]
>>>
>>> isolated_cpus_updating = true;
>>> ...
>>> // 'full_lock' is not acquired
>>> update_isolation_cpumasks
>> That is not true. Either cpus_read_lock or cpus_write_lock and cpuset_mutex are
>> held when update_isolation_cpumasks() is called. So there is mutual exclusion.
> 
> Eh, we currently assume that it can only be called from existing scenarios, so
> it's okay for now. But I'm concerned that if we later use
> update_isolation_cpumasks without realizing that we need to hold either
> cpus_write_lock or (cpus_read_lock && cpuset_mutex) , we could run into
> concurrency issues. Maybe I'm worrying too much.
> 
> And maybe we shuold add 'lockdep_assert_held' inside the  update_isolation_cpumasks.
> 

I saw in patch 2/2 that isolated_cpus_updating is described as "protected by
cpuset_top_mutex." This could be a bit ambiguous: the caller need to hold either
cpus_read_lock or cpus_write_lock and cpuset_mutex to protect
isolated_cpus_updating.

>>>                     // exec worker concurrently
>>>                     isolcpus_workfn
>>>                     cpuset_full_lock
>>>                     isolated_cpus_updating = false;
>>>                     cpuset_full_unlock();
>>> // This returns uncorrectly
>>> if (!isolated_cpus_updating)
>>>     return;
>>>
>> Cheers,
>> Longman
>>
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31  1:45     ` Waiman Long
@ 2026-01-31  2:05       ` Chen Ridong
  2026-01-31 23:00         ` Waiman Long
  0 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-01-31  2:05 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/1/31 9:45, Waiman Long wrote:
> On 1/30/26 7:58 PM, Chen Ridong wrote:
>>
>> On 2026/1/30 23:42, Waiman Long wrote:
>>> The update_isolation_cpumasks() function can be called either directly
>>> from regular cpuset control file write with cpuset_full_lock() called
>>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>>
>>> As we are going to enable dynamic update to the nozh_full housekeeping
>>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>>> allowing the CPU hotplug path to call into housekeeping_update() directly
>>> from update_isolation_cpumasks() will likely cause deadlock. So we
>>> have to defer any call to housekeeping_update() after the CPU hotplug
>>> operation has finished. This is now done via the workqueue where
>>> the actual housekeeping_update() call, if needed, will happen after
>>> cpus_write_lock is released.
>>>
>>> We can't use the synchronous task_work API as call from CPU hotplug
>>> path happen in the per-cpu kthread of the CPU that is being shut down
>>> or brought up. Because of the asynchronous nature of workqueue, the
>>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>>> "cpuset.cpus.isolated" control file in this case.
>>>
>>> Also add a check in test_cpuset_prs.sh and modify some existing
>>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>>> housekeeping cpumask will both be updated.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>>   .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>>   2 files changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index 7b7d12ab1006..0b0eb1df09d5 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -84,6 +84,9 @@ static cpumask_var_t    isolated_cpus;
>>>    */
>>>   static bool isolated_cpus_updating;
>>>   +/* Both cpuset_mutex and cpus_read_locked acquired */
>>> +static bool cpuset_locked;
>>> +
>>>   /*
>>>    * A flag to force sched domain rebuild at the end of an operation.
>>>    * It can be set in
>>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>>   {
>>>       cpus_read_lock();
>>>       mutex_lock(&cpuset_mutex);
>>> +    cpuset_locked = true;
>>>   }
>>>     void cpuset_full_unlock(void)
>>>   {
>>> +    cpuset_locked = false;
>>>       mutex_unlock(&cpuset_mutex);
>>>       cpus_read_unlock();
>>>   }
>>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate,
>>> struct cpumask *new_cpus)
>>>       return false;
>>>   }
>>>   +static void isolcpus_workfn(struct work_struct *work)
>>> +{
>>> +    cpuset_full_lock();
>>> +    if (isolated_cpus_updating) {
>>> +        WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>> +        isolated_cpus_updating = false;
>>> +    }
>>> +    cpuset_full_unlock();
>>> +}
>>> +
>>>   /*
>>>    * update_isolation_cpumasks - Update external isolation related CPU masks
>>>    *
>>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int
>>> prstate, struct cpumask *new_cpus)
>>>    */
>>>   static void update_isolation_cpumasks(void)
>>>   {
>>> -    int ret;
>>> +    static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>>         if (!isolated_cpus_updating)
>>>           return;
>>>   -    ret = housekeeping_update(isolated_cpus);
>>> -    WARN_ON_ONCE(ret < 0);
>>> +    /*
>>> +     * This function can be reached either directly from regular cpuset
>>> +     * control file write (cpuset_locked) or via hotplug (cpus_write_lock
>>> +     * && cpuset_mutex held). In the later case, we defer the
>>> +     * housekeeping_update() call to the system_unbound_wq to avoid the
>>> +     * possibility of deadlock. This also means that there will be a short
>>> +     * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
>>> +     * behind isolated_cpus.
>>> +     */
>>> +    if (!cpuset_locked) {
>> Adding a global variable makes this difficult to handle, especially in
>> concurrent scenarios, since we could read it outside of a critical region.
> No, cpuset_locked is always read from or written into inside a critical section.
> It is under cpuset_mutex up to this point and then with the cpuset_top_mutex
> with the next patch.

This is somewhat confusing. cpuset_locked is only set to true when the "full
lock" has been acquired. If cpuset_locked is false, that should mean we are
outside of any critical region. Conversely, if we are inside a critical region,
cpuset_locked should be true.

The situation is a bit messy, it’s not clearly which lock protects which global
variable.

>>
>> I suggest removing cpuset_locked and adding async_update_isolation_cpumasks
>> instead, which can indicate to the caller it should call without holding the
>> full lock.
> 
> The point of this global variable is to distinguish between calling from CPU
> hotplug and the other regular cpuset code paths. The only difference between
> these two are having cpus_read_lock or cpus_write_lock held. That is why I think
> adding a global variable in cpuset_full_lock() is the easy way. Otherwise, we
> will to add extra argument to some of the functions to distinguish these two cases.
> 
> Cheers,
> Longman
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-01-30 15:42 ` [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex Waiman Long
@ 2026-01-31  2:53   ` Chen Ridong
  2026-01-31 23:13     ` Waiman Long
  0 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-01-31  2:53 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/1/30 23:42, Waiman Long wrote:
> The current cpuset partition code is able to dynamically update
> the sched domains of a running system and the corresponding
> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
> "isolcpus=domain,..." boot command line feature at run time.
> 
> The housekeeping cpumask update requires flushing a number of different
> workqueues which may not be safe with cpus_read_lock() held as the
> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
> which have locking dependency with cpus_read_lock() down the chain. Below
> is an example of such circular locking problem.
> 
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   6.18.0-test+ #2 Tainted: G S
>   ------------------------------------------------------
>   test_cpuset_prs/10971 is trying to acquire lock:
>   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
> 
>   but task is already holding lock:
>   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
>   -> #4 (cpuset_mutex){+.+.}-{4:4}:
>   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>   -> #2 (rtnl_mutex){+.+.}-{4:4}:
>   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
> 
>   Chain exists of:
>     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
> 
>   5 locks held by test_cpuset_prs/10971:
>    #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>    #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>    #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>    #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>    #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   Call Trace:
>    <TASK>
>      :
>    touch_wq_lockdep_map+0x93/0x180
>    __flush_workqueue+0x111/0x10b0
>    housekeeping_update+0x12d/0x2d0
>    update_parent_effective_cpumask+0x595/0x2440
>    update_prstate+0x89d/0xce0
>    cpuset_partition_write+0xc5/0x130
>    cgroup_file_write+0x1a5/0x680
>    kernfs_fop_write_iter+0x3df/0x5f0
>    vfs_write+0x525/0xfd0
>    ksys_write+0xf9/0x1d0
>    do_syscall_64+0x95/0x520
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> To avoid such a circular locking dependency problem, we have to
> call housekeeping_update() without holding the cpus_read_lock() and
> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
> may not have work functions that call cpus_read_lock() directly,
> but we are likely to extend the list of wq's that are flushed in the
> future. Moreover, the current set of work functions may hold locks that
> may have cpu_hotplug_lock down the dependency chain.
> 
> One way to do that is to introduce a new top level cpuset_top_mutex
> which will be acquired first.  This new cpuset_top_mutex will provide
> the need mutual exclusion without the need to hold cpus_read_lock().
> 

Introducing a new global lock warrants careful consideration. I wonder if we
could make all updates to isolated_cpus asynchronous. If that is feasible, we
could avoid adding a global lock altogether. If not, we need to clarify which
updates must remain synchronous and which ones can be handled asynchronously.

> As cpus_read_lock() is now no longer held when
> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
> directly.
> 
> The lockdep_is_cpuset_held() is also updated to check the new
> cpuset_top_mutex.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c        | 101 +++++++++++++++++++++++-----------
>  kernel/sched/isolation.c      |   4 +-
>  kernel/time/timer_migration.c |   3 +-
>  3 files changed, 70 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0b0eb1df09d5..edccfa2df9da 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -78,13 +78,13 @@ static cpumask_var_t	subpartitions_cpus;
>  static cpumask_var_t	isolated_cpus;
>  
>  /*
> - * isolated_cpus updating flag (protected by cpuset_mutex)
> + * isolated_cpus updating flag (protected by cpuset_top_mutex)
>   * Set if isolated_cpus is going to be updated in the current
>   * cpuset_mutex crtical section.
>   */
>  static bool isolated_cpus_updating;
>  
> -/* Both cpuset_mutex and cpus_read_locked acquired */
> +/* cpuset_top_mutex acquired */
>  static bool cpuset_locked;
>  
>  /*
> @@ -222,29 +222,44 @@ struct cpuset top_cpuset = {
>  };
>  
>  /*
> - * There are two global locks guarding cpuset structures - cpuset_mutex and
> - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
> - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
> - * structures. Note that cpuset_mutex needs to be a mutex as it is used in
> - * paths that rely on priority inheritance (e.g. scheduler - on RT) for
> - * correctness.
> + * CPUSET Locking Convention
> + * -------------------------
>   *
> - * A task must hold both locks to modify cpusets.  If a task holds
> - * cpuset_mutex, it blocks others, ensuring that it is the only task able to
> - * also acquire callback_lock and be able to modify cpusets.  It can perform
> - * various checks on the cpuset structure first, knowing nothing will change.
> - * It can also allocate memory while just holding cpuset_mutex.  While it is
> - * performing these checks, various callback routines can briefly acquire
> - * callback_lock to query cpusets.  Once it is ready to make the changes, it
> - * takes callback_lock, blocking everyone else.
> + * Below are the four global locks guarding cpuset structures in lock
> + * acquisition order:
> + *  - cpuset_top_mutex
> + *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
> + *  - cpuset_mutex
> + *  - callback_lock (raw spinlock)
>   *
> - * Calls to the kernel memory allocator can not be made while holding
> - * callback_lock, as that would risk double tripping on callback_lock
> - * from one of the callbacks into the cpuset code from within
> - * __alloc_pages().
> + * The first cpuset_top_mutex will be held except when calling into
> + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
> + * and cpuset_mutex will be held instead.
>   *
> - * If a task is only holding callback_lock, then it has read-only
> - * access to cpusets.
> + * As cpuset will now indirectly flush a number of different workqueues in
> + * housekeeping_update() when the set of isolated CPUs is going to be changed,
> + * it may not be safe from the circular locking perspective to hold the
> + * cpus_read_lock. So cpus_read_lock and cpuset_mutex will be released before
> + * calling housekeeping_update() and re-acquired afterward.
> + *
> + * A task must hold all the remaining three locks to modify externally visible
> + * or used fields of cpusets, though some of the internally used cpuset fields
> + * can be modified without holding callback_lock. If only reliable read access
> + * of the externally used fields are needed, a task can hold either
> + * cpuset_mutex or callback_lock which are exposed to other subsystems.
> + *
> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
> + * ensuring that it is the only task able to also acquire callback_lock and
> + * be able to modify cpusets.  It can perform various checks on the cpuset
> + * structure first, knowing nothing will change. It can also allocate memory
> + * without holding callback_lock. While it is performing these checks, various
> + * callback routines can briefly acquire callback_lock to query cpusets.  Once
> + * it is ready to make the changes, it takes callback_lock, blocking everyone
> + * else.
> + *
> + * Calls to the kernel memory allocator cannot be made while holding
> + * callback_lock which is a spinlock, as the memory allocator may sleep or
> + * call back into cpuset code and acquire callback_lock.
>   *
>   * Now, the task_struct fields mems_allowed and mempolicy may be changed
>   * by other task, we use alloc_lock in the task_struct fields to protect
> @@ -255,6 +270,7 @@ struct cpuset top_cpuset = {
>   * cpumasks and nodemasks.
>   */
>  
> +static DEFINE_MUTEX(cpuset_top_mutex);
>  static DEFINE_MUTEX(cpuset_mutex);
>  
>  /**
> @@ -278,6 +294,18 @@ void lockdep_assert_cpuset_lock_held(void)
>  	lockdep_assert_held(&cpuset_mutex);
>  }
>  
> +static void cpuset_partial_lock(void)
> +{
> +	cpus_read_lock();
> +	mutex_lock(&cpuset_mutex);
> +}
> +
> +static void cpuset_partial_unlock(void)
> +{
> +	mutex_unlock(&cpuset_mutex);
> +	cpus_read_unlock();
> +}
> +
>  /**
>   * cpuset_full_lock - Acquire full protection for cpuset modification
>   *
> @@ -286,22 +314,22 @@ void lockdep_assert_cpuset_lock_held(void)
>   */
>  void cpuset_full_lock(void)
>  {
> -	cpus_read_lock();
> -	mutex_lock(&cpuset_mutex);
> +	mutex_lock(&cpuset_top_mutex);
> +	cpuset_partial_lock();
>  	cpuset_locked = true;
>  }
>  
>  void cpuset_full_unlock(void)
>  {
>  	cpuset_locked = false;
> -	mutex_unlock(&cpuset_mutex);
> -	cpus_read_unlock();
> +	cpuset_partial_unlock();
> +	mutex_unlock(&cpuset_top_mutex);
>  }
>  
>  #ifdef CONFIG_LOCKDEP
>  bool lockdep_is_cpuset_held(void)
>  {
> -	return lockdep_is_held(&cpuset_mutex);
> +	return lockdep_is_held(&cpuset_top_mutex);
>  }
>  #endif
>  
> @@ -1292,12 +1320,12 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  
>  static void isolcpus_workfn(struct work_struct *work)
>  {
> -	cpuset_full_lock();
> -	if (isolated_cpus_updating) {
> -		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> -		isolated_cpus_updating = false;
> -	}
> -	cpuset_full_unlock();
> +	guard(mutex)(&cpuset_top_mutex);
> +	if (!isolated_cpus_updating)
> +		return;
> +
> +	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> +	isolated_cpus_updating = false;
>  }
>  
>  /*
> @@ -1331,8 +1359,15 @@ static void update_isolation_cpumasks(void)
>  		return;
>  	}
>  
> +	lockdep_assert_held(&cpuset_top_mutex);
> +	/*
> +	 * Release cpus_read_lock & cpuset_mutex before calling
> +	 * housekeeping_update() and re-acquiring them afterward.
> +	 */
> +	cpuset_partial_unlock();
>  	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>  	isolated_cpus_updating = false;
> +	cpuset_partial_lock();
>  }
>  
>  /**
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 3b725d39c06e..ef152d401fe2 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
>  	struct cpumask *trial, *old = NULL;
>  	int err;
>  
> -	lockdep_assert_cpus_held();
> -
>  	trial = kmalloc(cpumask_size(), GFP_KERNEL);
>  	if (!trial)
>  		return -ENOMEM;
> @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
>  	}
>  
>  	if (!housekeeping.flags)
> -		static_branch_enable_cpuslocked(&housekeeping_overridden);
> +		static_branch_enable(&housekeeping_overridden);
>  
>  	if (housekeeping.flags & HK_FLAG_DOMAIN)
>  		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 6da9cd562b20..244a8d025e78 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
>  	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
>  	int cpu;
>  
> -	lockdep_assert_cpus_held();
> -
>  	if (!works)
>  		return -ENOMEM;
>  	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
>  	 * First set previously isolated CPUs as available (unisolate).
>  	 * This cpumask contains only CPUs that switched to available now.
>  	 */
> +	guard(cpus_read_lock)();
>  	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
>  	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
>  

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31  2:05       ` Chen Ridong
@ 2026-01-31 23:00         ` Waiman Long
  2026-02-02  0:58           ` Chen Ridong
  0 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-01-31 23:00 UTC (permalink / raw)
  To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 1/30/26 9:05 PM, Chen Ridong wrote:
>
> On 2026/1/31 9:45, Waiman Long wrote:
>> On 1/30/26 7:58 PM, Chen Ridong wrote:
>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>> The update_isolation_cpumasks() function can be called either directly
>>>> from regular cpuset control file write with cpuset_full_lock() called
>>>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>>>
>>>> As we are going to enable dynamic update to the nozh_full housekeeping
>>>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>>>> allowing the CPU hotplug path to call into housekeeping_update() directly
>>>> from update_isolation_cpumasks() will likely cause deadlock. So we
>>>> have to defer any call to housekeeping_update() after the CPU hotplug
>>>> operation has finished. This is now done via the workqueue where
>>>> the actual housekeeping_update() call, if needed, will happen after
>>>> cpus_write_lock is released.
>>>>
>>>> We can't use the synchronous task_work API as call from CPU hotplug
>>>> path happen in the per-cpu kthread of the CPU that is being shut down
>>>> or brought up. Because of the asynchronous nature of workqueue, the
>>>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>>>> "cpuset.cpus.isolated" control file in this case.
>>>>
>>>> Also add a check in test_cpuset_prs.sh and modify some existing
>>>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>>>> housekeeping cpumask will both be updated.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>>    kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>>>    .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>>>    2 files changed, 44 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>> index 7b7d12ab1006..0b0eb1df09d5 100644
>>>> --- a/kernel/cgroup/cpuset.c
>>>> +++ b/kernel/cgroup/cpuset.c
>>>> @@ -84,6 +84,9 @@ static cpumask_var_t    isolated_cpus;
>>>>     */
>>>>    static bool isolated_cpus_updating;
>>>>    +/* Both cpuset_mutex and cpus_read_locked acquired */
>>>> +static bool cpuset_locked;
>>>> +
>>>>    /*
>>>>     * A flag to force sched domain rebuild at the end of an operation.
>>>>     * It can be set in
>>>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>>>    {
>>>>        cpus_read_lock();
>>>>        mutex_lock(&cpuset_mutex);
>>>> +    cpuset_locked = true;
>>>>    }
>>>>      void cpuset_full_unlock(void)
>>>>    {
>>>> +    cpuset_locked = false;
>>>>        mutex_unlock(&cpuset_mutex);
>>>>        cpus_read_unlock();
>>>>    }
>>>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate,
>>>> struct cpumask *new_cpus)
>>>>        return false;
>>>>    }
>>>>    +static void isolcpus_workfn(struct work_struct *work)
>>>> +{
>>>> +    cpuset_full_lock();
>>>> +    if (isolated_cpus_updating) {
>>>> +        WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>>> +        isolated_cpus_updating = false;
>>>> +    }
>>>> +    cpuset_full_unlock();
>>>> +}
>>>> +
>>>>    /*
>>>>     * update_isolation_cpumasks - Update external isolation related CPU masks
>>>>     *
>>>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int
>>>> prstate, struct cpumask *new_cpus)
>>>>     */
>>>>    static void update_isolation_cpumasks(void)
>>>>    {
>>>> -    int ret;
>>>> +    static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>>>          if (!isolated_cpus_updating)
>>>>            return;
>>>>    -    ret = housekeeping_update(isolated_cpus);
>>>> -    WARN_ON_ONCE(ret < 0);
>>>> +    /*
>>>> +     * This function can be reached either directly from regular cpuset
>>>> +     * control file write (cpuset_locked) or via hotplug (cpus_write_lock
>>>> +     * && cpuset_mutex held). In the later case, we defer the
>>>> +     * housekeeping_update() call to the system_unbound_wq to avoid the
>>>> +     * possibility of deadlock. This also means that there will be a short
>>>> +     * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
>>>> +     * behind isolated_cpus.
>>>> +     */
>>>> +    if (!cpuset_locked) {
>>> Adding a global variable makes this difficult to handle, especially in
>>> concurrent scenarios, since we could read it outside of a critical region.
>> No, cpuset_locked is always read from or written into inside a critical section.
>> It is under cpuset_mutex up to this point and then with the cpuset_top_mutex
>> with the next patch.
> This is somewhat confusing. cpuset_locked is only set to true when the "full
> lock" has been acquired. If cpuset_locked is false, that should mean we are
> outside of any critical region. Conversely, if we are inside a critical region,
> cpuset_locked should be true.
>
> The situation is a bit messy, it’s not clearly which lock protects which global
> variable.

There is a comment above "cpuset_locked" which state which lock protect 
it. The locking situation is becoming more complicated. I think I will 
add a new patch to more clearly document what each global variable is 
being protected by.

Cheers,
Longman

>
>>> I suggest removing cpuset_locked and adding async_update_isolation_cpumasks
>>> instead, which can indicate to the caller it should call without holding the
>>> full lock.
>> The point of this global variable is to distinguish between calling from CPU
>> hotplug and the other regular cpuset code paths. The only difference between
>> these two are having cpus_read_lock or cpus_write_lock held. That is why I think
>> adding a global variable in cpuset_full_lock() is the easy way. Otherwise, we
>> will to add extra argument to some of the functions to distinguish these two cases.
>>
>> Cheers,
>> Longman
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-01-31  2:53   ` Chen Ridong
@ 2026-01-31 23:13     ` Waiman Long
  2026-02-02  1:11       ` Chen Ridong
  0 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-01-31 23:13 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest


On 1/30/26 9:53 PM, Chen Ridong wrote:
>
> On 2026/1/30 23:42, Waiman Long wrote:
>> The current cpuset partition code is able to dynamically update
>> the sched domains of a running system and the corresponding
>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>> "isolcpus=domain,..." boot command line feature at run time.
>>
>> The housekeeping cpumask update requires flushing a number of different
>> workqueues which may not be safe with cpus_read_lock() held as the
>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>> which have locking dependency with cpus_read_lock() down the chain. Below
>> is an example of such circular locking problem.
>>
>>    ======================================================
>>    WARNING: possible circular locking dependency detected
>>    6.18.0-test+ #2 Tainted: G S
>>    ------------------------------------------------------
>>    test_cpuset_prs/10971 is trying to acquire lock:
>>    ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
>>
>>    but task is already holding lock:
>>    ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>>    which lock already depends on the new lock.
>>
>>    the existing dependency chain (in reverse order) is:
>>    -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>    -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>    -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>    -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>    -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>
>>    Chain exists of:
>>      (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>
>>    5 locks held by test_cpuset_prs/10971:
>>     #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>>     #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>>     #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>>     #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>>     #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>>    Call Trace:
>>     <TASK>
>>       :
>>     touch_wq_lockdep_map+0x93/0x180
>>     __flush_workqueue+0x111/0x10b0
>>     housekeeping_update+0x12d/0x2d0
>>     update_parent_effective_cpumask+0x595/0x2440
>>     update_prstate+0x89d/0xce0
>>     cpuset_partition_write+0xc5/0x130
>>     cgroup_file_write+0x1a5/0x680
>>     kernfs_fop_write_iter+0x3df/0x5f0
>>     vfs_write+0x525/0xfd0
>>     ksys_write+0xf9/0x1d0
>>     do_syscall_64+0x95/0x520
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> To avoid such a circular locking dependency problem, we have to
>> call housekeeping_update() without holding the cpus_read_lock() and
>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>> may not have work functions that call cpus_read_lock() directly,
>> but we are likely to extend the list of wq's that are flushed in the
>> future. Moreover, the current set of work functions may hold locks that
>> may have cpu_hotplug_lock down the dependency chain.
>>
>> One way to do that is to introduce a new top level cpuset_top_mutex
>> which will be acquired first.  This new cpuset_top_mutex will provide
>> the need mutual exclusion without the need to hold cpus_read_lock().
>>
> Introducing a new global lock warrants careful consideration. I wonder if we
> could make all updates to isolated_cpus asynchronous. If that is feasible, we
> could avoid adding a global lock altogether. If not, we need to clarify which
> updates must remain synchronous and which ones can be handled asynchronously.

Almost all the cpuset code are run with cpuset_mutex held with either 
cpus_read_lock or cpus_write_lock. So there is no concurrent 
access/update to any of the cpuset internal data. The new 
cpuset_top_mutex is aded to resolve the possible deadlock scenarios with 
the new housekeeping_update() call without breaking this model. Allow 
parallel concurrent access/update to cpuset data will greatly complicate 
the code and we will likely missed some corner cases that we have to fix 
in the future. We will only do that if cpuset is in a critical 
performance path, but it is not. It is not just isolated_cpus that we 
are protecting, all the other cpuset data may be at risk if we don't 
have another top level mutex to protect them.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-31 23:00         ` Waiman Long
@ 2026-02-02  0:58           ` Chen Ridong
  0 siblings, 0 replies; 23+ messages in thread
From: Chen Ridong @ 2026-02-02  0:58 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/1 7:00, Waiman Long wrote:
> On 1/30/26 9:05 PM, Chen Ridong wrote:
>>
>> On 2026/1/31 9:45, Waiman Long wrote:
>>> On 1/30/26 7:58 PM, Chen Ridong wrote:
>>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>>> The update_isolation_cpumasks() function can be called either directly
>>>>> from regular cpuset control file write with cpuset_full_lock() called
>>>>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>>>>
>>>>> As we are going to enable dynamic update to the nozh_full housekeeping
>>>>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>>>>> allowing the CPU hotplug path to call into housekeeping_update() directly
>>>>> from update_isolation_cpumasks() will likely cause deadlock. So we
>>>>> have to defer any call to housekeeping_update() after the CPU hotplug
>>>>> operation has finished. This is now done via the workqueue where
>>>>> the actual housekeeping_update() call, if needed, will happen after
>>>>> cpus_write_lock is released.
>>>>>
>>>>> We can't use the synchronous task_work API as call from CPU hotplug
>>>>> path happen in the per-cpu kthread of the CPU that is being shut down
>>>>> or brought up. Because of the asynchronous nature of workqueue, the
>>>>> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
>>>>> "cpuset.cpus.isolated" control file in this case.
>>>>>
>>>>> Also add a check in test_cpuset_prs.sh and modify some existing
>>>>> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
>>>>> housekeeping cpumask will both be updated.
>>>>>
>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>> ---
>>>>>    kernel/cgroup/cpuset.c                        | 37 +++++++++++++++++--
>>>>>    .../selftests/cgroup/test_cpuset_prs.sh       | 13 +++++--
>>>>>    2 files changed, 44 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>>> index 7b7d12ab1006..0b0eb1df09d5 100644
>>>>> --- a/kernel/cgroup/cpuset.c
>>>>> +++ b/kernel/cgroup/cpuset.c
>>>>> @@ -84,6 +84,9 @@ static cpumask_var_t    isolated_cpus;
>>>>>     */
>>>>>    static bool isolated_cpus_updating;
>>>>>    +/* Both cpuset_mutex and cpus_read_locked acquired */
>>>>> +static bool cpuset_locked;
>>>>> +
>>>>>    /*
>>>>>     * A flag to force sched domain rebuild at the end of an operation.
>>>>>     * It can be set in
>>>>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>>>>    {
>>>>>        cpus_read_lock();
>>>>>        mutex_lock(&cpuset_mutex);
>>>>> +    cpuset_locked = true;
>>>>>    }
>>>>>      void cpuset_full_unlock(void)
>>>>>    {
>>>>> +    cpuset_locked = false;
>>>>>        mutex_unlock(&cpuset_mutex);
>>>>>        cpus_read_unlock();
>>>>>    }
>>>>> @@ -1285,6 +1290,16 @@ static bool prstate_housekeeping_conflict(int prstate,
>>>>> struct cpumask *new_cpus)
>>>>>        return false;
>>>>>    }
>>>>>    +static void isolcpus_workfn(struct work_struct *work)
>>>>> +{
>>>>> +    cpuset_full_lock();
>>>>> +    if (isolated_cpus_updating) {
>>>>> +        WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>>>> +        isolated_cpus_updating = false;
>>>>> +    }
>>>>> +    cpuset_full_unlock();
>>>>> +}
>>>>> +
>>>>>    /*
>>>>>     * update_isolation_cpumasks - Update external isolation related CPU masks
>>>>>     *
>>>>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int
>>>>> prstate, struct cpumask *new_cpus)
>>>>>     */
>>>>>    static void update_isolation_cpumasks(void)
>>>>>    {
>>>>> -    int ret;
>>>>> +    static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>>>>          if (!isolated_cpus_updating)
>>>>>            return;
>>>>>    -    ret = housekeeping_update(isolated_cpus);
>>>>> -    WARN_ON_ONCE(ret < 0);
>>>>> +    /*
>>>>> +     * This function can be reached either directly from regular cpuset
>>>>> +     * control file write (cpuset_locked) or via hotplug (cpus_write_lock
>>>>> +     * && cpuset_mutex held). In the later case, we defer the
>>>>> +     * housekeeping_update() call to the system_unbound_wq to avoid the
>>>>> +     * possibility of deadlock. This also means that there will be a short
>>>>> +     * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
>>>>> +     * behind isolated_cpus.
>>>>> +     */
>>>>> +    if (!cpuset_locked) {
>>>> Adding a global variable makes this difficult to handle, especially in
>>>> concurrent scenarios, since we could read it outside of a critical region.
>>> No, cpuset_locked is always read from or written into inside a critical section.
>>> It is under cpuset_mutex up to this point and then with the cpuset_top_mutex
>>> with the next patch.
>> This is somewhat confusing. cpuset_locked is only set to true when the "full
>> lock" has been acquired. If cpuset_locked is false, that should mean we are
>> outside of any critical region. Conversely, if we are inside a critical region,
>> cpuset_locked should be true.
>>
>> The situation is a bit messy, it’s not clearly which lock protects which global
>> variable.
> 
> There is a comment above "cpuset_locked" which state which lock protect it. The
> locking situation is becoming more complicated. I think I will add a new patch
> to more clearly document what each global variable is being protected by.
> 

Yes, We need that.

> 
>>
>>>> I suggest removing cpuset_locked and adding async_update_isolation_cpumasks
>>>> instead, which can indicate to the caller it should call without holding the
>>>> full lock.
>>> The point of this global variable is to distinguish between calling from CPU
>>> hotplug and the other regular cpuset code paths. The only difference between
>>> these two are having cpus_read_lock or cpus_write_lock held. That is why I think
>>> adding a global variable in cpuset_full_lock() is the easy way. Otherwise, we
>>> will to add extra argument to some of the functions to distinguish these two
>>> cases.
>>>
>>> Cheers,
>>> Longman
>>>
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-01-31 23:13     ` Waiman Long
@ 2026-02-02  1:11       ` Chen Ridong
  2026-02-02 18:29         ` Waiman Long
  0 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-02-02  1:11 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/1 7:13, Waiman Long wrote:
> 
> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>
>> On 2026/1/30 23:42, Waiman Long wrote:
>>> The current cpuset partition code is able to dynamically update
>>> the sched domains of a running system and the corresponding
>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>> "isolcpus=domain,..." boot command line feature at run time.
>>>
>>> The housekeeping cpumask update requires flushing a number of different
>>> workqueues which may not be safe with cpus_read_lock() held as the
>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>> is an example of such circular locking problem.
>>>
>>>    ======================================================
>>>    WARNING: possible circular locking dependency detected
>>>    6.18.0-test+ #2 Tainted: G S
>>>    ------------------------------------------------------
>>>    test_cpuset_prs/10971 is trying to acquire lock:
>>>    ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>> touch_wq_lockdep_map+0x7a/0x180
>>>
>>>    but task is already holding lock:
>>>    ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>> cpuset_partition_write+0x85/0x130
>>>
>>>    which lock already depends on the new lock.
>>>
>>>    the existing dependency chain (in reverse order) is:
>>>    -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>    -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>    -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>    -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>    -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>
>>>    Chain exists of:
>>>      (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>
>>>    5 locks held by test_cpuset_prs/10971:
>>>     #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>>>     #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>> kernfs_fop_write_iter+0x260/0x5f0
>>>     #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>     #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>> cpuset_partition_write+0x77/0x130
>>>     #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>> cpuset_partition_write+0x85/0x130
>>>
>>>    Call Trace:
>>>     <TASK>
>>>       :
>>>     touch_wq_lockdep_map+0x93/0x180
>>>     __flush_workqueue+0x111/0x10b0
>>>     housekeeping_update+0x12d/0x2d0
>>>     update_parent_effective_cpumask+0x595/0x2440
>>>     update_prstate+0x89d/0xce0
>>>     cpuset_partition_write+0xc5/0x130
>>>     cgroup_file_write+0x1a5/0x680
>>>     kernfs_fop_write_iter+0x3df/0x5f0
>>>     vfs_write+0x525/0xfd0
>>>     ksys_write+0xf9/0x1d0
>>>     do_syscall_64+0x95/0x520
>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> To avoid such a circular locking dependency problem, we have to
>>> call housekeeping_update() without holding the cpus_read_lock() and
>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>> may not have work functions that call cpus_read_lock() directly,
>>> but we are likely to extend the list of wq's that are flushed in the
>>> future. Moreover, the current set of work functions may hold locks that
>>> may have cpu_hotplug_lock down the dependency chain.
>>>
>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>
>> Introducing a new global lock warrants careful consideration. I wonder if we
>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>> could avoid adding a global lock altogether. If not, we need to clarify which
>> updates must remain synchronous and which ones can be handled asynchronously.
> 
> Almost all the cpuset code are run with cpuset_mutex held with either
> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
> possible deadlock scenarios with the new housekeeping_update() call without
> breaking this model. Allow parallel concurrent access/update to cpuset data will
> greatly complicate the code and we will likely missed some corner cases that we

I agree with that point. However, we already have paths where isolated_cpus is
updated asynchronously, meaning parallel concurrent access/update is already
happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
the locking simple(make all updates to isolated_cpus asynchronous)?

This is just a thought in my mind.

> have to fix in the future. We will only do that if cpuset is in a critical
> performance path, but it is not. It is not just isolated_cpus that we are
> protecting, all the other cpuset data may be at risk if we don't have another
> top level mutex to protect them.
> 
> Cheers,
> Longman
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-01-30 15:42 ` [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue Waiman Long
  2026-01-31  0:47   ` Chen Ridong
  2026-01-31  0:58   ` Chen Ridong
@ 2026-02-02 13:05   ` Peter Zijlstra
  2026-02-02 18:21     ` Waiman Long
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2026-02-02 13:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On Fri, Jan 30, 2026 at 10:42:53AM -0500, Waiman Long wrote:

> +/* Both cpuset_mutex and cpus_read_locked acquired */
> +static bool cpuset_locked;
> +
>  /*
>   * A flag to force sched domain rebuild at the end of an operation.
>   * It can be set in
> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>  {
>  	cpus_read_lock();
>  	mutex_lock(&cpuset_mutex);
> +	cpuset_locked = true;
>  }
>  
>  void cpuset_full_unlock(void)
>  {
> +	cpuset_locked = false;
>  	mutex_unlock(&cpuset_mutex);
>  	cpus_read_unlock();
>  }

> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>   */
>  static void update_isolation_cpumasks(void)
>  {
> -	int ret;
> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>  
>  	if (!isolated_cpus_updating)
>  		return;
>  
> -	ret = housekeeping_update(isolated_cpus);
> -	WARN_ON_ONCE(ret < 0);
> +	/*
> +	 * This function can be reached either directly from regular cpuset
> +	 * control file write (cpuset_locked) or via hotplug (cpus_write_lock
> +	 * && cpuset_mutex held). In the later case, we defer the
> +	 * housekeeping_update() call to the system_unbound_wq to avoid the
> +	 * possibility of deadlock. This also means that there will be a short
> +	 * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
> +	 * behind isolated_cpus.
> +	 */
> +	if (!cpuset_locked) {

I agree with Chen that this is bloody terrible.

At the very least this should have:

	lockdep_assert_held(&cpuset_mutex);

But ideally you'd do patches against this and tip/locking/core that add
proper __guarded_by() annotations to this.

> +		/*
> +		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
> +		 * item that is still pending.
> +		 */
> +		queue_work(system_unbound_wq, &isolcpus_work);
> +		return;
> +	}
>  
> +	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>  	isolated_cpus_updating = false;
>  }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-02-02 13:05   ` Peter Zijlstra
@ 2026-02-02 18:21     ` Waiman Long
  2026-02-02 20:04       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-02-02 18:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On 2/2/26 8:05 AM, Peter Zijlstra wrote:
> On Fri, Jan 30, 2026 at 10:42:53AM -0500, Waiman Long wrote:
>
>> +/* Both cpuset_mutex and cpus_read_locked acquired */
>> +static bool cpuset_locked;
>> +
>>   /*
>>    * A flag to force sched domain rebuild at the end of an operation.
>>    * It can be set in
>> @@ -285,10 +288,12 @@ void cpuset_full_lock(void)
>>   {
>>   	cpus_read_lock();
>>   	mutex_lock(&cpuset_mutex);
>> +	cpuset_locked = true;
>>   }
>>   
>>   void cpuset_full_unlock(void)
>>   {
>> +	cpuset_locked = false;
>>   	mutex_unlock(&cpuset_mutex);
>>   	cpus_read_unlock();
>>   }
>> @@ -1293,14 +1308,30 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>>    */
>>   static void update_isolation_cpumasks(void)
>>   {
>> -	int ret;
>> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>>   
>>   	if (!isolated_cpus_updating)
>>   		return;
>>   
>> -	ret = housekeeping_update(isolated_cpus);
>> -	WARN_ON_ONCE(ret < 0);
>> +	/*
>> +	 * This function can be reached either directly from regular cpuset
>> +	 * control file write (cpuset_locked) or via hotplug (cpus_write_lock
>> +	 * && cpuset_mutex held). In the later case, we defer the
>> +	 * housekeeping_update() call to the system_unbound_wq to avoid the
>> +	 * possibility of deadlock. This also means that there will be a short
>> +	 * period of time where HK_TYPE_DOMAIN housekeeping cpumask will lag
>> +	 * behind isolated_cpus.
>> +	 */
>> +	if (!cpuset_locked) {
> I agree with Chen that this is bloody terrible.
>
> At the very least this should have:
>
> 	lockdep_assert_held(&cpuset_mutex);
>
> But ideally you'd do patches against this and tip/locking/core that add
> proper __guarded_by() annotations to this.

Yes, I am going to remove cpuset_locked in the next version. As for 
__guarded_by() annotation, I need to set up a clang environment that I 
can use to test it before I will work on that. I usually just use gcc 
for my compilation need.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-02-02  1:11       ` Chen Ridong
@ 2026-02-02 18:29         ` Waiman Long
  2026-02-04  1:55           ` Chen Ridong
  0 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2026-02-02 18:29 UTC (permalink / raw)
  To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/1/26 8:11 PM, Chen Ridong wrote:
>
> On 2026/2/1 7:13, Waiman Long wrote:
>> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>> The current cpuset partition code is able to dynamically update
>>>> the sched domains of a running system and the corresponding
>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>>> "isolcpus=domain,..." boot command line feature at run time.
>>>>
>>>> The housekeeping cpumask update requires flushing a number of different
>>>> workqueues which may not be safe with cpus_read_lock() held as the
>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>>> is an example of such circular locking problem.
>>>>
>>>>     ======================================================
>>>>     WARNING: possible circular locking dependency detected
>>>>     6.18.0-test+ #2 Tainted: G S
>>>>     ------------------------------------------------------
>>>>     test_cpuset_prs/10971 is trying to acquire lock:
>>>>     ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>>> touch_wq_lockdep_map+0x7a/0x180
>>>>
>>>>     but task is already holding lock:
>>>>     ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>> cpuset_partition_write+0x85/0x130
>>>>
>>>>     which lock already depends on the new lock.
>>>>
>>>>     the existing dependency chain (in reverse order) is:
>>>>     -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>>     -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>>     -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>>     -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>>     -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>>
>>>>     Chain exists of:
>>>>       (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>>
>>>>     5 locks held by test_cpuset_prs/10971:
>>>>      #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>>>>      #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>>> kernfs_fop_write_iter+0x260/0x5f0
>>>>      #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>>      #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>>> cpuset_partition_write+0x77/0x130
>>>>      #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>> cpuset_partition_write+0x85/0x130
>>>>
>>>>     Call Trace:
>>>>      <TASK>
>>>>        :
>>>>      touch_wq_lockdep_map+0x93/0x180
>>>>      __flush_workqueue+0x111/0x10b0
>>>>      housekeeping_update+0x12d/0x2d0
>>>>      update_parent_effective_cpumask+0x595/0x2440
>>>>      update_prstate+0x89d/0xce0
>>>>      cpuset_partition_write+0xc5/0x130
>>>>      cgroup_file_write+0x1a5/0x680
>>>>      kernfs_fop_write_iter+0x3df/0x5f0
>>>>      vfs_write+0x525/0xfd0
>>>>      ksys_write+0xf9/0x1d0
>>>>      do_syscall_64+0x95/0x520
>>>>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> To avoid such a circular locking dependency problem, we have to
>>>> call housekeeping_update() without holding the cpus_read_lock() and
>>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>>> may not have work functions that call cpus_read_lock() directly,
>>>> but we are likely to extend the list of wq's that are flushed in the
>>>> future. Moreover, the current set of work functions may hold locks that
>>>> may have cpu_hotplug_lock down the dependency chain.
>>>>
>>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>>
>>> Introducing a new global lock warrants careful consideration. I wonder if we
>>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>>> could avoid adding a global lock altogether. If not, we need to clarify which
>>> updates must remain synchronous and which ones can be handled asynchronously.
>> Almost all the cpuset code are run with cpuset_mutex held with either
>> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
>> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
>> possible deadlock scenarios with the new housekeeping_update() call without
>> breaking this model. Allow parallel concurrent access/update to cpuset data will
>> greatly complicate the code and we will likely missed some corner cases that we
> I agree with that point. However, we already have paths where isolated_cpus is
> updated asynchronously, meaning parallel concurrent access/update is already
> happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
> the locking simple(make all updates to isolated_cpus asynchronous)?

isolated_cpus should only be updated in isolated_cpus_update() where 
both cpuset_mutex and callback_lock are held. It can be read 
asynchronously if either cpuset_mutex or callback_lock is held. Can you 
show me the  places where this rule isn't followed?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-02-02 18:21     ` Waiman Long
@ 2026-02-02 20:04       ` Peter Zijlstra
  2026-02-02 20:06         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2026-02-02 20:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On Mon, Feb 02, 2026 at 01:21:43PM -0500, Waiman Long wrote:

> Yes, I am going to remove cpuset_locked in the next version. As for
> __guarded_by() annotation, I need to set up a clang environment that I can
> use to test it before I will work on that. I usually just use gcc for my
> compilation need.

Debian experimental has clang-22, but there is also:

  https://github.com/llvm/llvm-project/releases/tag/llvmorg-22.1.0-rc2

See: Documentation/kbuild/llvm.rst


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-02-02 20:04       ` Peter Zijlstra
@ 2026-02-02 20:06         ` Peter Zijlstra
  2026-02-03  0:59           ` Waiman Long
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2026-02-02 20:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On Mon, Feb 02, 2026 at 09:04:57PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 01:21:43PM -0500, Waiman Long wrote:
> 
> > Yes, I am going to remove cpuset_locked in the next version. As for
> > __guarded_by() annotation, I need to set up a clang environment that I can
> > use to test it before I will work on that. I usually just use gcc for my
> > compilation need.
> 
> Debian experimental has clang-22, but there is also:
> 
>   https://github.com/llvm/llvm-project/releases/tag/llvmorg-22.1.0-rc2

Damn, copied wrong link:

  https://www.kernel.org/pub/tools/llvm/files/llvm-22.1.0-rc2-x86_64.tar.xz

> See: Documentation/kbuild/llvm.rst
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
  2026-02-02 20:06         ` Peter Zijlstra
@ 2026-02-03  0:59           ` Waiman Long
  0 siblings, 0 replies; 23+ messages in thread
From: Waiman Long @ 2026-02-03  0:59 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On 2/2/26 3:06 PM, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 09:04:57PM +0100, Peter Zijlstra wrote:
>> On Mon, Feb 02, 2026 at 01:21:43PM -0500, Waiman Long wrote:
>>
>>> Yes, I am going to remove cpuset_locked in the next version. As for
>>> __guarded_by() annotation, I need to set up a clang environment that I can
>>> use to test it before I will work on that. I usually just use gcc for my
>>> compilation need.
>> Debian experimental has clang-22, but there is also:
>>
>>    https://github.com/llvm/llvm-project/releases/tag/llvmorg-22.1.0-rc2
> Damn, copied wrong link:
>
>    https://www.kernel.org/pub/tools/llvm/files/llvm-22.1.0-rc2-x86_64.tar.xz

Thanks for the link. Will play around with that.

Cheers,
Longman

>> See: Documentation/kbuild/llvm.rst
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-02-02 18:29         ` Waiman Long
@ 2026-02-04  1:55           ` Chen Ridong
  2026-02-04 20:52             ` Waiman Long
  0 siblings, 1 reply; 23+ messages in thread
From: Chen Ridong @ 2026-02-04  1:55 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/3 2:29, Waiman Long wrote:
> On 2/1/26 8:11 PM, Chen Ridong wrote:
>>
>> On 2026/2/1 7:13, Waiman Long wrote:
>>> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>>> The current cpuset partition code is able to dynamically update
>>>>> the sched domains of a running system and the corresponding
>>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>>>> "isolcpus=domain,..." boot command line feature at run time.
>>>>>
>>>>> The housekeeping cpumask update requires flushing a number of different
>>>>> workqueues which may not be safe with cpus_read_lock() held as the
>>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>>>> is an example of such circular locking problem.
>>>>>
>>>>>     ======================================================
>>>>>     WARNING: possible circular locking dependency detected
>>>>>     6.18.0-test+ #2 Tainted: G S
>>>>>     ------------------------------------------------------
>>>>>     test_cpuset_prs/10971 is trying to acquire lock:
>>>>>     ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>>>> touch_wq_lockdep_map+0x7a/0x180
>>>>>
>>>>>     but task is already holding lock:
>>>>>     ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>> cpuset_partition_write+0x85/0x130
>>>>>
>>>>>     which lock already depends on the new lock.
>>>>>
>>>>>     the existing dependency chain (in reverse order) is:
>>>>>     -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>>>     -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>>>     -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>>>     -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>>>     -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>>>
>>>>>     Chain exists of:
>>>>>       (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>>>
>>>>>     5 locks held by test_cpuset_prs/10971:
>>>>>      #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at:
>>>>> ksys_write+0xf9/0x1d0
>>>>>      #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>>>> kernfs_fop_write_iter+0x260/0x5f0
>>>>>      #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>>>      #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>>>> cpuset_partition_write+0x77/0x130
>>>>>      #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>> cpuset_partition_write+0x85/0x130
>>>>>
>>>>>     Call Trace:
>>>>>      <TASK>
>>>>>        :
>>>>>      touch_wq_lockdep_map+0x93/0x180
>>>>>      __flush_workqueue+0x111/0x10b0
>>>>>      housekeeping_update+0x12d/0x2d0
>>>>>      update_parent_effective_cpumask+0x595/0x2440
>>>>>      update_prstate+0x89d/0xce0
>>>>>      cpuset_partition_write+0xc5/0x130
>>>>>      cgroup_file_write+0x1a5/0x680
>>>>>      kernfs_fop_write_iter+0x3df/0x5f0
>>>>>      vfs_write+0x525/0xfd0
>>>>>      ksys_write+0xf9/0x1d0
>>>>>      do_syscall_64+0x95/0x520
>>>>>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> To avoid such a circular locking dependency problem, we have to
>>>>> call housekeeping_update() without holding the cpus_read_lock() and
>>>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>>>> may not have work functions that call cpus_read_lock() directly,
>>>>> but we are likely to extend the list of wq's that are flushed in the
>>>>> future. Moreover, the current set of work functions may hold locks that
>>>>> may have cpu_hotplug_lock down the dependency chain.
>>>>>
>>>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>>>
>>>> Introducing a new global lock warrants careful consideration. I wonder if we
>>>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>>>> could avoid adding a global lock altogether. If not, we need to clarify which
>>>> updates must remain synchronous and which ones can be handled asynchronously.
>>> Almost all the cpuset code are run with cpuset_mutex held with either
>>> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
>>> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
>>> possible deadlock scenarios with the new housekeeping_update() call without
>>> breaking this model. Allow parallel concurrent access/update to cpuset data will
>>> greatly complicate the code and we will likely missed some corner cases that we
>> I agree with that point. However, we already have paths where isolated_cpus is
>> updated asynchronously, meaning parallel concurrent access/update is already
>> happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
>> the locking simple(make all updates to isolated_cpus asynchronous)?
> 
> isolated_cpus should only be updated in isolated_cpus_update() where both
> cpuset_mutex and callback_lock are held. It can be read asynchronously if either
> cpuset_mutex or callback_lock is held. Can you show me the  places where this
> rule isn't followed?
> 

I was considering that since the hotplug path calls update_isolation_cpumasks
asynchronously, could other cpuset paths (such as setting CPUs or partitions)
also call update_isolation_cpumasks asynchronously? If so, the global
cpuset_top_mutex lock might be unnecessary. Note that isolated_cpus is updated
synchronously, while housekeeping_update is invoked asynchronously.

Just a thought for discussion, and I’d really appreciate your insights on this.

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex
  2026-02-04  1:55           ` Chen Ridong
@ 2026-02-04 20:52             ` Waiman Long
  0 siblings, 0 replies; 23+ messages in thread
From: Waiman Long @ 2026-02-04 20:52 UTC (permalink / raw)
  To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/3/26 8:55 PM, Chen Ridong wrote:
>
> On 2026/2/3 2:29, Waiman Long wrote:
>> On 2/1/26 8:11 PM, Chen Ridong wrote:
>>> On 2026/2/1 7:13, Waiman Long wrote:
>>>> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>>>> The current cpuset partition code is able to dynamically update
>>>>>> the sched domains of a running system and the corresponding
>>>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>>>>> "isolcpus=domain,..." boot command line feature at run time.
>>>>>>
>>>>>> The housekeeping cpumask update requires flushing a number of different
>>>>>> workqueues which may not be safe with cpus_read_lock() held as the
>>>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>>>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>>>>> is an example of such circular locking problem.
>>>>>>
>>>>>>      ======================================================
>>>>>>      WARNING: possible circular locking dependency detected
>>>>>>      6.18.0-test+ #2 Tainted: G S
>>>>>>      ------------------------------------------------------
>>>>>>      test_cpuset_prs/10971 is trying to acquire lock:
>>>>>>      ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>>>>> touch_wq_lockdep_map+0x7a/0x180
>>>>>>
>>>>>>      but task is already holding lock:
>>>>>>      ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>>> cpuset_partition_write+0x85/0x130
>>>>>>
>>>>>>      which lock already depends on the new lock.
>>>>>>
>>>>>>      the existing dependency chain (in reverse order) is:
>>>>>>      -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>>>>      -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>>>>      -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>>>>      -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>>>>      -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>>>>
>>>>>>      Chain exists of:
>>>>>>        (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>>>>
>>>>>>      5 locks held by test_cpuset_prs/10971:
>>>>>>       #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at:
>>>>>> ksys_write+0xf9/0x1d0
>>>>>>       #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>>>>> kernfs_fop_write_iter+0x260/0x5f0
>>>>>>       #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>>>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>>>>       #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>>>>> cpuset_partition_write+0x77/0x130
>>>>>>       #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>>> cpuset_partition_write+0x85/0x130
>>>>>>
>>>>>>      Call Trace:
>>>>>>       <TASK>
>>>>>>         :
>>>>>>       touch_wq_lockdep_map+0x93/0x180
>>>>>>       __flush_workqueue+0x111/0x10b0
>>>>>>       housekeeping_update+0x12d/0x2d0
>>>>>>       update_parent_effective_cpumask+0x595/0x2440
>>>>>>       update_prstate+0x89d/0xce0
>>>>>>       cpuset_partition_write+0xc5/0x130
>>>>>>       cgroup_file_write+0x1a5/0x680
>>>>>>       kernfs_fop_write_iter+0x3df/0x5f0
>>>>>>       vfs_write+0x525/0xfd0
>>>>>>       ksys_write+0xf9/0x1d0
>>>>>>       do_syscall_64+0x95/0x520
>>>>>>       entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>>
>>>>>> To avoid such a circular locking dependency problem, we have to
>>>>>> call housekeeping_update() without holding the cpus_read_lock() and
>>>>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>>>>> may not have work functions that call cpus_read_lock() directly,
>>>>>> but we are likely to extend the list of wq's that are flushed in the
>>>>>> future. Moreover, the current set of work functions may hold locks that
>>>>>> may have cpu_hotplug_lock down the dependency chain.
>>>>>>
>>>>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>>>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>>>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>>>>
>>>>> Introducing a new global lock warrants careful consideration. I wonder if we
>>>>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>>>>> could avoid adding a global lock altogether. If not, we need to clarify which
>>>>> updates must remain synchronous and which ones can be handled asynchronously.
>>>> Almost all the cpuset code are run with cpuset_mutex held with either
>>>> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
>>>> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
>>>> possible deadlock scenarios with the new housekeeping_update() call without
>>>> breaking this model. Allow parallel concurrent access/update to cpuset data will
>>>> greatly complicate the code and we will likely missed some corner cases that we
>>> I agree with that point. However, we already have paths where isolated_cpus is
>>> updated asynchronously, meaning parallel concurrent access/update is already
>>> happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
>>> the locking simple(make all updates to isolated_cpus asynchronous)?
>> isolated_cpus should only be updated in isolated_cpus_update() where both
>> cpuset_mutex and callback_lock are held. It can be read asynchronously if either
>> cpuset_mutex or callback_lock is held. Can you show me the  places where this
>> rule isn't followed?
>>
> I was considering that since the hotplug path calls update_isolation_cpumasks
> asynchronously, could other cpuset paths (such as setting CPUs or partitions)
> also call update_isolation_cpumasks asynchronously? If so, the global
> cpuset_top_mutex lock might be unnecessary. Note that isolated_cpus is updated
> synchronously, while housekeeping_update is invoked asynchronously.

update_isolation_cpumasks() is always called synchronously as 
cpuset_mutex will always be held. With the current patchset, the only 
asynchronous piece is CPU hotplug vs the the housekeeping_update() call 
as it is being called without holding cpus_read_lock(). AFASICS, it 
should not be a problem. Please let me if you are aware of some 
potential hazard with the current setup.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-02-04 20:52 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30 15:42 [PATCH/for-next v2 0/2] cgroup/cpuset: Fix partition related locking issues Waiman Long
2026-01-30 15:42 ` [PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue Waiman Long
2026-01-31  0:47   ` Chen Ridong
2026-01-31  1:06     ` Waiman Long
2026-01-31  1:43       ` Chen Ridong
2026-01-31  1:49         ` Chen Ridong
2026-01-31  0:58   ` Chen Ridong
2026-01-31  1:45     ` Waiman Long
2026-01-31  2:05       ` Chen Ridong
2026-01-31 23:00         ` Waiman Long
2026-02-02  0:58           ` Chen Ridong
2026-02-02 13:05   ` Peter Zijlstra
2026-02-02 18:21     ` Waiman Long
2026-02-02 20:04       ` Peter Zijlstra
2026-02-02 20:06         ` Peter Zijlstra
2026-02-03  0:59           ` Waiman Long
2026-01-30 15:42 ` [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex Waiman Long
2026-01-31  2:53   ` Chen Ridong
2026-01-31 23:13     ` Waiman Long
2026-02-02  1:11       ` Chen Ridong
2026-02-02 18:29         ` Waiman Long
2026-02-04  1:55           ` Chen Ridong
2026-02-04 20:52             ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox