[PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues
@ 2026-02-06 20:37 Waiman Long
  2026-02-06 20:37 ` [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-06 20:37 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

 v4:
  - Fix various issues as reported by Chen Ridong.

 v3:
  - Add a new patch to clarify the locking rules for internal variables
  - Defer all housekeeping_update() calls with associated
    rebuild_sched_domains*() calls to either workqueue or task_work.

 v2:
  - Change patch 1 to use workqueue instead of task run as it is a
    per-cpu kthread that performs the cpuset shutdown and bringup work.
  - Simplify and streamline some of the code.

After booting the latest cgroup for-next debug kernel with the latest
cgroup changes as well as Federic's "cpuset/isolation: Honour kthreads
preferred affinity" patch series [1] merged on top and running the
test-cpuset-prs.sh test, a circular locking dependency lockdep splat
was reported. See patch 2 for details.

To fix this issue, a new top level cpuset_top_mutex is added and the
call to housekeeping_update() is deferred to either a task_work or to
a workqueue.

With these changes in place, the cpuset test ran to completion with no
failure and no lockdep splat.

[1] https://lore.kernel.org/lkml/20260125224541.50226-1-frederic@kernel.org/

Waiman Long (4):
  cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to
    workqueue
  cgroup/cpuset: Call housekeeping_update() without holding
    cpus_read_lock
  cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls

 kernel/cgroup/cpuset.c                        | 241 +++++++++++++-----
 kernel/sched/isolation.c                      |   4 +-
 kernel/time/timer_migration.c                 |   4 +-
 .../selftests/cgroup/test_cpuset_prs.sh       |  13 +-
 4 files changed, 196 insertions(+), 66 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  2026-02-06 20:37 [PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues Waiman Long
@ 2026-02-06 20:37 ` Waiman Long
  2026-02-09  3:41   ` Chen Ridong
  2026-02-06 20:37 ` [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2026-02-06 20:37 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

Clarify the locking rules associated with file level internal variables
inside the cpuset code. There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
 1 file changed, 61 insertions(+), 44 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c43efef7df71..a4c6386a594d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -61,6 +61,58 @@ static const char * const perr_strings[] = {
 	[PERR_REMOTE]    = "Have remote partition underneath",
 };
 
+/*
+ * CPUSET Locking Convention
+ * -------------------------
+ *
+ * Below are the three global locks guarding cpuset structures in lock
+ * acquisition order:
+ *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
+ *  - cpuset_mutex
+ *  - callback_lock (raw spinlock)
+ *
+ * A task must hold all the three locks to modify externally visible or
+ * used fields of cpusets, though some of the internally used cpuset fields
+ * and internal variables can be modified without holding callback_lock. If only
+ * reliable read access of the externally used fields are needed, a task can
+ * hold either cpuset_mutex or callback_lock which are exposed to other
+ * external subsystems.
+ *
+ * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
+ * ensuring that it is the only task able to also acquire callback_lock and
+ * be able to modify cpusets.  It can perform various checks on the cpuset
+ * structure first, knowing nothing will change. It can also allocate memory
+ * without holding callback_lock. While it is performing these checks, various
+ * callback routines can briefly acquire callback_lock to query cpusets.  Once
+ * it is ready to make the changes, it takes callback_lock, blocking everyone
+ * else.
+ *
+ * Calls to the kernel memory allocator cannot be made while holding
+ * callback_lock which is a spinlock, as the memory allocator may sleep or
+ * call back into cpuset code and acquire callback_lock.
+ *
+ * Now, the task_struct fields mems_allowed and mempolicy may be changed
+ * by other task, we use alloc_lock in the task_struct fields to protect
+ * them.
+ *
+ * The cpuset_common_seq_show() handlers only hold callback_lock across
+ * small pieces of code, such as when reading out possibly multi-word
+ * cpumasks and nodemasks.
+ */
+
+static DEFINE_MUTEX(cpuset_mutex);
+
+/*
+ * File level internal variables below follow one of the following exclusion
+ * rules.
+ *
+ * RWCS: Read/write-able by holding either cpus_write_lock or both
+ *       cpus_read_lock and cpuset_mutex.
+ *
+ * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
+ *	 by holding both cpuset_mutex and callback_lock.
+ */
+
 /*
  * For local partitions, update to subpartitions_cpus & isolated_cpus is done
  * in update_parent_effective_cpumask(). For remote partitions, it is done in
@@ -70,19 +122,18 @@ static const char * const perr_strings[] = {
  * Exclusive CPUs distributed out to local or remote sub-partitions of
  * top_cpuset
  */
-static cpumask_var_t	subpartitions_cpus;
+static cpumask_var_t	subpartitions_cpus;	/* RWCS */
 
 /*
- * Exclusive CPUs in isolated partitions
+ * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated)
  */
-static cpumask_var_t	isolated_cpus;
+static cpumask_var_t	isolated_cpus;		/* CSCB */
 
 /*
- * isolated_cpus updating flag (protected by cpuset_mutex)
- * Set if isolated_cpus is going to be updated in the current
- * cpuset_mutex crtical section.
+ * Set if isolated_cpus is being updated in the current cpuset_mutex
+ * critical section.
  */
-static bool isolated_cpus_updating;
+static bool		isolated_cpus_updating;	/* RWCS */
 
 /*
  * A flag to force sched domain rebuild at the end of an operation.
@@ -98,7 +149,7 @@ static bool isolated_cpus_updating;
  * Note that update_relax_domain_level() in cpuset-v1.c can still call
  * rebuild_sched_domains_locked() directly without using this flag.
  */
-static bool force_sd_rebuild;
+static bool force_sd_rebuild;			/* RWCS */
 
 /*
  * Partition root states:
@@ -218,42 +269,6 @@ struct cpuset top_cpuset = {
 	.partition_root_state = PRS_ROOT,
 };
 
-/*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
- * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
- * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
- * structures. Note that cpuset_mutex needs to be a mutex as it is used in
- * paths that rely on priority inheritance (e.g. scheduler - on RT) for
- * correctness.
- *
- * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_mutex, it blocks others, ensuring that it is the only task able to
- * also acquire callback_lock and be able to modify cpusets.  It can perform
- * various checks on the cpuset structure first, knowing nothing will change.
- * It can also allocate memory while just holding cpuset_mutex.  While it is
- * performing these checks, various callback routines can briefly acquire
- * callback_lock to query cpusets.  Once it is ready to make the changes, it
- * takes callback_lock, blocking everyone else.
- *
- * Calls to the kernel memory allocator can not be made while holding
- * callback_lock, as that would risk double tripping on callback_lock
- * from one of the callbacks into the cpuset code from within
- * __alloc_pages().
- *
- * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
- *
- * Now, the task_struct fields mems_allowed and mempolicy may be changed
- * by other task, we use alloc_lock in the task_struct fields to protect
- * them.
- *
- * The cpuset_common_seq_show() handlers only hold callback_lock across
- * small pieces of code, such as when reading out possibly multi-word
- * cpumasks and nodemasks.
- */
-
-static DEFINE_MUTEX(cpuset_mutex);
-
 /**
  * cpuset_lock - Acquire the global cpuset mutex
  *
@@ -1163,6 +1178,8 @@ static void reset_partition_data(struct cpuset *cs)
 static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
 {
 	WARN_ON_ONCE(old_prs == new_prs);
+	lockdep_assert_held(&callback_lock);
+	lockdep_assert_held(&cpuset_mutex);
 	if (new_prs == PRS_ISOLATED)
 		cpumask_or(isolated_cpus, isolated_cpus, xcpus);
 	else
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  2026-02-06 20:37 ` [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
@ 2026-02-09  3:41   ` Chen Ridong
  2026-02-09 19:58     ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2026-02-09  3:41 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/7 4:37, Waiman Long wrote:
> Clarify the locking rules associated with file level internal variables
> inside the cpuset code. There is no functional change.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
>  1 file changed, 61 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index c43efef7df71..a4c6386a594d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -61,6 +61,58 @@ static const char * const perr_strings[] = {
>  	[PERR_REMOTE]    = "Have remote partition underneath",
>  };
>  
> +/*
> + * CPUSET Locking Convention
> + * -------------------------
> + *
> + * Below are the three global locks guarding cpuset structures in lock
> + * acquisition order:
> + *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
> + *  - cpuset_mutex
> + *  - callback_lock (raw spinlock)
> + *
> + * A task must hold all the three locks to modify externally visible or
> + * used fields of cpusets, though some of the internally used cpuset fields
> + * and internal variables can be modified without holding callback_lock. If only
> + * reliable read access of the externally used fields are needed, a task can
> + * hold either cpuset_mutex or callback_lock which are exposed to other
> + * external subsystems.
> + *
> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
> + * ensuring that it is the only task able to also acquire callback_lock and
> + * be able to modify cpusets.  It can perform various checks on the cpuset
> + * structure first, knowing nothing will change. It can also allocate memory
> + * without holding callback_lock. While it is performing these checks, various
> + * callback routines can briefly acquire callback_lock to query cpusets.  Once
> + * it is ready to make the changes, it takes callback_lock, blocking everyone
> + * else.
> + *
> + * Calls to the kernel memory allocator cannot be made while holding
> + * callback_lock which is a spinlock, as the memory allocator may sleep or
> + * call back into cpuset code and acquire callback_lock.
> + *
> + * Now, the task_struct fields mems_allowed and mempolicy may be changed
> + * by other task, we use alloc_lock in the task_struct fields to protect
> + * them.
> + *
> + * The cpuset_common_seq_show() handlers only hold callback_lock across
> + * small pieces of code, such as when reading out possibly multi-word
> + * cpumasks and nodemasks.
> + */
> +
> +static DEFINE_MUTEX(cpuset_mutex);
> +
> +/*
> + * File level internal variables below follow one of the following exclusion
> + * rules.
> + *
> + * RWCS: Read/write-able by holding either cpus_write_lock or both
> + *       cpus_read_lock and cpuset_mutex.
> + *

Does this mean that variables can be read or written only by holding
cpus_write_lock?

I believe that to write cpuset variables, we must hold either (cpus_write_lock
and cpuset_mutex) or (cpus_read_lock and cpuset_mutex).

> + * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
> + *	 by holding both cpuset_mutex and callback_lock.
> + */
> +
>  /*
>   * For local partitions, update to subpartitions_cpus & isolated_cpus is done
>   * in update_parent_effective_cpumask(). For remote partitions, it is done in
> @@ -70,19 +122,18 @@ static const char * const perr_strings[] = {
>   * Exclusive CPUs distributed out to local or remote sub-partitions of
>   * top_cpuset
>   */
> -static cpumask_var_t	subpartitions_cpus;
> +static cpumask_var_t	subpartitions_cpus;	/* RWCS */
>  
>  /*
> - * Exclusive CPUs in isolated partitions
> + * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated)
>   */
> -static cpumask_var_t	isolated_cpus;
> +static cpumask_var_t	isolated_cpus;		/* CSCB */
>  
>  /*
> - * isolated_cpus updating flag (protected by cpuset_mutex)
> - * Set if isolated_cpus is going to be updated in the current
> - * cpuset_mutex crtical section.
> + * Set if isolated_cpus is being updated in the current cpuset_mutex
> + * critical section.
>   */
> -static bool isolated_cpus_updating;
> +static bool		isolated_cpus_updating;	/* RWCS */
>  
>  /*
>   * A flag to force sched domain rebuild at the end of an operation.
> @@ -98,7 +149,7 @@ static bool isolated_cpus_updating;
>   * Note that update_relax_domain_level() in cpuset-v1.c can still call
>   * rebuild_sched_domains_locked() directly without using this flag.
>   */
> -static bool force_sd_rebuild;
> +static bool force_sd_rebuild;			/* RWCS */
>  
>  /*
>   * Partition root states:
> @@ -218,42 +269,6 @@ struct cpuset top_cpuset = {
>  	.partition_root_state = PRS_ROOT,
>  };
>  
> -/*
> - * There are two global locks guarding cpuset structures - cpuset_mutex and
> - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
> - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
> - * structures. Note that cpuset_mutex needs to be a mutex as it is used in
> - * paths that rely on priority inheritance (e.g. scheduler - on RT) for
> - * correctness.
> - *
> - * A task must hold both locks to modify cpusets.  If a task holds
> - * cpuset_mutex, it blocks others, ensuring that it is the only task able to
> - * also acquire callback_lock and be able to modify cpusets.  It can perform
> - * various checks on the cpuset structure first, knowing nothing will change.
> - * It can also allocate memory while just holding cpuset_mutex.  While it is
> - * performing these checks, various callback routines can briefly acquire
> - * callback_lock to query cpusets.  Once it is ready to make the changes, it
> - * takes callback_lock, blocking everyone else.
> - *
> - * Calls to the kernel memory allocator can not be made while holding
> - * callback_lock, as that would risk double tripping on callback_lock
> - * from one of the callbacks into the cpuset code from within
> - * __alloc_pages().
> - *
> - * If a task is only holding callback_lock, then it has read-only
> - * access to cpusets.
> - *
> - * Now, the task_struct fields mems_allowed and mempolicy may be changed
> - * by other task, we use alloc_lock in the task_struct fields to protect
> - * them.
> - *
> - * The cpuset_common_seq_show() handlers only hold callback_lock across
> - * small pieces of code, such as when reading out possibly multi-word
> - * cpumasks and nodemasks.
> - */
> -
> -static DEFINE_MUTEX(cpuset_mutex);
> -
>  /**
>   * cpuset_lock - Acquire the global cpuset mutex
>   *
> @@ -1163,6 +1178,8 @@ static void reset_partition_data(struct cpuset *cs)
>  static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
>  {
>  	WARN_ON_ONCE(old_prs == new_prs);
> +	lockdep_assert_held(&callback_lock);
> +	lockdep_assert_held(&cpuset_mutex);
>  	if (new_prs == PRS_ISOLATED)
>  		cpumask_or(isolated_cpus, isolated_cpus, xcpus);
>  	else

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
  2026-02-09  3:41   ` Chen Ridong
@ 2026-02-09 19:58     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-09 19:58 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/8/26 10:41 PM, Chen Ridong wrote:
>
> On 2026/2/7 4:37, Waiman Long wrote:
>> Clarify the locking rules associated with file level internal variables
>> inside the cpuset code. There is no functional change.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
>>   1 file changed, 61 insertions(+), 44 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index c43efef7df71..a4c6386a594d 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -61,6 +61,58 @@ static const char * const perr_strings[] = {
>>   	[PERR_REMOTE]    = "Have remote partition underneath",
>>   };
>>   
>> +/*
>> + * CPUSET Locking Convention
>> + * -------------------------
>> + *
>> + * Below are the three global locks guarding cpuset structures in lock
>> + * acquisition order:
>> + *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
>> + *  - cpuset_mutex
>> + *  - callback_lock (raw spinlock)
>> + *
>> + * A task must hold all the three locks to modify externally visible or
>> + * used fields of cpusets, though some of the internally used cpuset fields
>> + * and internal variables can be modified without holding callback_lock. If only
>> + * reliable read access of the externally used fields are needed, a task can
>> + * hold either cpuset_mutex or callback_lock which are exposed to other
>> + * external subsystems.
>> + *
>> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
>> + * ensuring that it is the only task able to also acquire callback_lock and
>> + * be able to modify cpusets.  It can perform various checks on the cpuset
>> + * structure first, knowing nothing will change. It can also allocate memory
>> + * without holding callback_lock. While it is performing these checks, various
>> + * callback routines can briefly acquire callback_lock to query cpusets.  Once
>> + * it is ready to make the changes, it takes callback_lock, blocking everyone
>> + * else.
>> + *
>> + * Calls to the kernel memory allocator cannot be made while holding
>> + * callback_lock which is a spinlock, as the memory allocator may sleep or
>> + * call back into cpuset code and acquire callback_lock.
>> + *
>> + * Now, the task_struct fields mems_allowed and mempolicy may be changed
>> + * by other task, we use alloc_lock in the task_struct fields to protect
>> + * them.
>> + *
>> + * The cpuset_common_seq_show() handlers only hold callback_lock across
>> + * small pieces of code, such as when reading out possibly multi-word
>> + * cpumasks and nodemasks.
>> + */
>> +
>> +static DEFINE_MUTEX(cpuset_mutex);
>> +
>> +/*
>> + * File level internal variables below follow one of the following exclusion
>> + * rules.
>> + *
>> + * RWCS: Read/write-able by holding either cpus_write_lock or both
>> + *       cpus_read_lock and cpuset_mutex.
>> + *
> Does this mean that variables can be read or written only by holding
> cpus_write_lock?
>
> I believe that to write cpuset variables, we must hold either (cpus_write_lock
> and cpuset_mutex) or (cpus_read_lock and cpuset_mutex).

The importance of the locking rule is to emphasize the condition for 
mutual exclusion. Once cpus_write_lock is held, no other task can hold 
cpus_read_lock and cpuset_mutex. I will consider holding cpuset_mutex as 
optional, though almost all the cpuset internal variables are accessed 
from the CPU hotplug side with both cpus_write_lock and cpuset_mutex 
held. The only exception is force_sd_rebuild (sd_rebuild) that can be 
set directly from the scheduling code without holding cpuset_mtuex. I 
can change it to "holding cpus_write_lock (and optionally cpuset_mutex) 
or both cpus_read_lock and cpuset_mutex" if that makes it clearer.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-06 20:37 [PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues Waiman Long
  2026-02-06 20:37 ` [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
@ 2026-02-06 20:37 ` Waiman Long
  2026-02-06 22:28   ` Frederic Weisbecker
  2026-02-09  6:57   ` Chen Ridong
  2026-02-06 20:37 ` [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
  2026-02-06 20:37 ` [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Waiman Long
  3 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-06 20:37 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The update_isolation_cpumasks() function can be called either directly
from regular cpuset control file write with cpuset_full_lock() called
or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.

As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the actual housekeeping_update() call, if needed, will happen after
cpus_write_lock is released.

We can't use the synchronous task_work API as call from CPU hotplug
path happen in the per-cpu kthread of the CPU that is being shut down
or brought up. Because of the asynchronous nature of workqueue, the
HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
"cpuset.cpus.isolated" control file in this case.

Also add a check in test_cpuset_prs.sh and modify some existing
test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
housekeeping cpumask will both be updated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c                        | 41 +++++++++++++++++--
 .../selftests/cgroup/test_cpuset_prs.sh       | 13 ++++--
 2 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a4c6386a594d..eb0eabd85e8c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1302,6 +1302,17 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 	return false;
 }
 
+static void isolcpus_workfn(struct work_struct *work)
+{
+	cpuset_full_lock();
+	if (isolated_cpus_updating) {
+		isolated_cpus_updating = false;
+		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
+		rebuild_sched_domains_locked();
+	}
+	cpuset_full_unlock();
+}
+
 /*
  * update_isolation_cpumasks - Update external isolation related CPU masks
  *
@@ -1310,14 +1321,38 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
  */
 static void update_isolation_cpumasks(void)
 {
-	int ret;
+	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
 
+	lockdep_assert_cpuset_lock_held();
 	if (!isolated_cpus_updating)
 		return;
 
-	ret = housekeeping_update(isolated_cpus);
-	WARN_ON_ONCE(ret < 0);
+	/*
+	 * This function can be reached either directly from regular cpuset
+	 * control file write or via CPU hotplug. In the latter case, it is
+	 * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf
+	 * of the task that initiates CPU shutdown or bringup.
+	 *
+	 * To have better flexibility and prevent the possibility of deadlock
+	 * when calling from CPU hotplug, we defer the housekeeping_update()
+	 * call to after the current cpuset critical section has finished.
+	 * This is done via workqueue.
+	 */
+	if (current->flags & PF_KTHREAD) {
+		/*
+		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
+		 * item that is still pending. Before the pending bit is
+		 * cleared, the work data is copied out and work item dequeued.
+		 * So it is possible to queue the work again before the
+		 * isolcpus_workfn() is invoked to process the previously
+		 * queued work. Since isolcpus_workfn() doesn't use the work
+		 * item at all, this is not a problem.
+		 */
+		queue_work(system_unbound_wq, &isolcpus_work);
+		return;
+	}
 
+	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
 	isolated_cpus_updating = false;
 }
 
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 5dff3ad53867..0502b156582b 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -245,8 +245,9 @@ TEST_MATRIX=(
 	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
 	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
 	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
-	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
+	"C2-3:P1:S+  C3:P2  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
+	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
+	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
 	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -764,7 +765,7 @@ check_cgroup_states()
 # only CPUs in isolated partitions as well as those that are isolated at
 # boot time.
 #
-# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
 # <isolcpus1> - expected sched/domains value
 # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
 #
@@ -773,6 +774,7 @@ check_isolcpus()
 	EXPECTED_ISOLCPUS=$1
 	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
 	ISOLCPUS=$(cat $ISCPUS)
+	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
 	LASTISOLCPU=
 	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
 	if [[ $EXPECTED_ISOLCPUS = . ]]
@@ -810,6 +812,11 @@ check_isolcpus()
 	ISOLCPUS=
 	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
 
+	#
+	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
+	#
+	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
+
 	#
 	# Use the sched domain in debugfs to check isolated CPUs, if available
 	#
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-06 20:37 ` [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
@ 2026-02-06 22:28   ` Frederic Weisbecker
  2026-02-08  2:00     ` Waiman Long
  2026-02-09  6:57   ` Chen Ridong
  1 sibling, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 22:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

Le Fri, Feb 06, 2026 at 03:37:10PM -0500, Waiman Long a écrit :
> The update_isolation_cpumasks() function can be called either directly
> from regular cpuset control file write with cpuset_full_lock() called
> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we

Why do we need to call housekeeping_update() from hotplug? I would
expect it to be called only when cpuset control file are written since
housekeeping cpumask don't deal with online CPUs but with possible
CPUs.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-06 22:28   ` Frederic Weisbecker
@ 2026-02-08  2:00     ` Waiman Long
  2026-02-10 15:46       ` Frederic Weisbecker
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2026-02-08  2:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On 2/6/26 5:28 PM, Frederic Weisbecker wrote:
> Le Fri, Feb 06, 2026 at 03:37:10PM -0500, Waiman Long a écrit :
>> The update_isolation_cpumasks() function can be called either directly
>> from regular cpuset control file write with cpuset_full_lock() called
>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
> Why do we need to call housekeeping_update() from hotplug? I would
> expect it to be called only when cpuset control file are written since
> housekeeping cpumask don't deal with online CPUs but with possible
> CPUs.

It needs to call housekeeping_update() only in the special case where 
there is only one active CPU in an isolated partition and that CPU goes 
offline. In this case, the partition becomes disabled that causes change 
in the isolated CPUs. I know this special case shouldn't happen in real 
world, but I do have test case to test that.

Theoretically, we can add code to handle this special case to keep this 
offline isolated CPU in a special pool without changing isolated_cpus 
and hence  HK_TYPE_DOMAIN cpumask. In this way, we shouldn't need to 
call housekeeping_update() from CPU hotplug. I will probably do that as 
CPU hotplug will be used when we make HK_TYPE_KERNEL_NOISE cpumask 
dynamic in the near future.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-08  2:00     ` Waiman Long
@ 2026-02-10 15:46       ` Frederic Weisbecker
  2026-02-10 18:53         ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2026-02-10 15:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

Le Sat, Feb 07, 2026 at 09:00:45PM -0500, Waiman Long a écrit :
> On 2/6/26 5:28 PM, Frederic Weisbecker wrote:
> > Le Fri, Feb 06, 2026 at 03:37:10PM -0500, Waiman Long a écrit :
> > > The update_isolation_cpumasks() function can be called either directly
> > > from regular cpuset control file write with cpuset_full_lock() called
> > > or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> > > 
> > > As we are going to enable dynamic update to the nozh_full housekeeping
> > > cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> > > allowing the CPU hotplug path to call into housekeeping_update() directly
> > > from update_isolation_cpumasks() will likely cause deadlock. So we
> > Why do we need to call housekeeping_update() from hotplug? I would
> > expect it to be called only when cpuset control file are written since
> > housekeeping cpumask don't deal with online CPUs but with possible
> > CPUs.
> 
> It needs to call housekeeping_update() only in the special case where there
> is only one active CPU in an isolated partition and that CPU goes offline.
> In this case, the partition becomes disabled that causes change in the
> isolated CPUs. I know this special case shouldn't happen in real world, but
> I do have test case to test that.

But why is that needed? This isn't changing the mask of domain isolated CPUs.
Only their onlineness. I mean timers, workqueue, kthreads all have their
hotplug callbacks able to deal with that already.

> Theoretically, we can add code to handle this special case to keep this
> offline isolated CPU in a special pool without changing isolated_cpus and
> hence  HK_TYPE_DOMAIN cpumask. In this way, we shouldn't need to call
> housekeeping_update() from CPU hotplug. I will probably do that as CPU
> hotplug will be used when we make HK_TYPE_KERNEL_NOISE cpumask dynamic in
> the near future.

That doesn't look necessary.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-10 15:46       ` Frederic Weisbecker
@ 2026-02-10 18:53         ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-10 18:53 UTC (permalink / raw)
  To: Frederic Weisbecker, Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Thomas Gleixner, Shuah Khan, cgroups,
	linux-kernel, linux-kselftest

On 2/10/26 10:46 AM, Frederic Weisbecker wrote:
> Le Sat, Feb 07, 2026 at 09:00:45PM -0500, Waiman Long a écrit :
>> On 2/6/26 5:28 PM, Frederic Weisbecker wrote:
>>> Le Fri, Feb 06, 2026 at 03:37:10PM -0500, Waiman Long a écrit :
>>>> The update_isolation_cpumasks() function can be called either directly
>>>> from regular cpuset control file write with cpuset_full_lock() called
>>>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>>>
>>>> As we are going to enable dynamic update to the nozh_full housekeeping
>>>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>>>> allowing the CPU hotplug path to call into housekeeping_update() directly
>>>> from update_isolation_cpumasks() will likely cause deadlock. So we
>>> Why do we need to call housekeeping_update() from hotplug? I would
>>> expect it to be called only when cpuset control file are written since
>>> housekeeping cpumask don't deal with online CPUs but with possible
>>> CPUs.
>> It needs to call housekeeping_update() only in the special case where there
>> is only one active CPU in an isolated partition and that CPU goes offline.
>> In this case, the partition becomes disabled that causes change in the
>> isolated CPUs. I know this special case shouldn't happen in real world, but
>> I do have test case to test that.
> But why is that needed? This isn't changing the mask of domain isolated CPUs.
> Only their onlineness. I mean timers, workqueue, kthreads all have their
> hotplug callbacks able to deal with that already.

The current behavior is to remove the CPUs from the cpuset.cpus.isolated 
when an isolated partition is invalidated. It doesn't currently 
differentiate if that is from hotplug or by writing to the cpuset 
control files. I am planning to handle handle hotplug differently so 
that it won't need to change cpuset.cpus.isolated.

>
>> Theoretically, we can add code to handle this special case to keep this
>> offline isolated CPU in a special pool without changing isolated_cpus and
>> hence  HK_TYPE_DOMAIN cpumask. In this way, we shouldn't need to call
>> housekeeping_update() from CPU hotplug. I will probably do that as CPU
>> hotplug will be used when we make HK_TYPE_KERNEL_NOISE cpumask dynamic in
>> the near future.
> That doesn't look necessary.

Yes, I think we can use the existing infrastructure to handle it without 
the need to add a special pool.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
  2026-02-06 20:37 ` [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
  2026-02-06 22:28   ` Frederic Weisbecker
@ 2026-02-09  6:57   ` Chen Ridong
  1 sibling, 0 replies; 22+ messages in thread
From: Chen Ridong @ 2026-02-09  6:57 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/7 4:37, Waiman Long wrote:
> The update_isolation_cpumasks() function can be called either directly
> from regular cpuset control file write with cpuset_full_lock() called
> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
						    ^
						nohz_full
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the actual housekeeping_update() call, if needed, will happen after
> cpus_write_lock is released.
> 
> We can't use the synchronous task_work API as call from CPU hotplug
> path happen in the per-cpu kthread of the CPU that is being shut down
> or brought up. Because of the asynchronous nature of workqueue, the
> HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
> "cpuset.cpus.isolated" control file in this case.
> 
> Also add a check in test_cpuset_prs.sh and modify some existing
> test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
> housekeeping cpumask will both be updated.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c                        | 41 +++++++++++++++++--
>  .../selftests/cgroup/test_cpuset_prs.sh       | 13 ++++--
>  2 files changed, 48 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a4c6386a594d..eb0eabd85e8c 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1302,6 +1302,17 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  	return false;
>  }
>  
> +static void isolcpus_workfn(struct work_struct *work)
> +{
> +	cpuset_full_lock();
> +	if (isolated_cpus_updating) {
> +		isolated_cpus_updating = false;
> +		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> +		rebuild_sched_domains_locked();
> +	}
> +	cpuset_full_unlock();
> +}
> +
>  /*
>   * update_isolation_cpumasks - Update external isolation related CPU masks
>   *
> @@ -1310,14 +1321,38 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>   */
>  static void update_isolation_cpumasks(void)
>  {
> -	int ret;
> +	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
>  
> +	lockdep_assert_cpuset_lock_held();
>  	if (!isolated_cpus_updating)
>  		return;
>  
> -	ret = housekeeping_update(isolated_cpus);
> -	WARN_ON_ONCE(ret < 0);
> +	/*
> +	 * This function can be reached either directly from regular cpuset
> +	 * control file write or via CPU hotplug. In the latter case, it is
> +	 * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf
> +	 * of the task that initiates CPU shutdown or bringup.
> +	 *
> +	 * To have better flexibility and prevent the possibility of deadlock
> +	 * when calling from CPU hotplug, we defer the housekeeping_update()
> +	 * call to after the current cpuset critical section has finished.
> +	 * This is done via workqueue.
> +	 */
> +	if (current->flags & PF_KTHREAD) {
> +		/*
> +		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
> +		 * item that is still pending. Before the pending bit is
> +		 * cleared, the work data is copied out and work item dequeued.
> +		 * So it is possible to queue the work again before the
> +		 * isolcpus_workfn() is invoked to process the previously
> +		 * queued work. Since isolcpus_workfn() doesn't use the work
> +		 * item at all, this is not a problem.
> +		 */
> +		queue_work(system_unbound_wq, &isolcpus_work);
> +		return;
> +	}
>  
> +	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>  	isolated_cpus_updating = false;
>  }
>  
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 5dff3ad53867..0502b156582b 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -245,8 +245,9 @@ TEST_MATRIX=(
>  	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
>  	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
>  	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
> -	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
> -	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
> +	"C2-3:P1:S+  C3:P2  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
> +	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
> +	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
>  	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>  	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>  	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -764,7 +765,7 @@ check_cgroup_states()
>  # only CPUs in isolated partitions as well as those that are isolated at
>  # boot time.
>  #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>  # <isolcpus1> - expected sched/domains value
>  # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
>  #
> @@ -773,6 +774,7 @@ check_isolcpus()
>  	EXPECTED_ISOLCPUS=$1
>  	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>  	ISOLCPUS=$(cat $ISCPUS)
> +	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>  	LASTISOLCPU=
>  	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>  	if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -810,6 +812,11 @@ check_isolcpus()
>  	ISOLCPUS=
>  	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>  
> +	#
> +	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> +	#
> +	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
>  	#
>  	# Use the sched domain in debugfs to check isolated CPUs, if available
>  	#

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-06 20:37 [PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues Waiman Long
  2026-02-06 20:37 ` [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
  2026-02-06 20:37 ` [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
@ 2026-02-06 20:37 ` Waiman Long
  2026-02-09  7:12   ` Chen Ridong
  2026-02-09  7:23   ` Chen Ridong
  2026-02-06 20:37 ` [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Waiman Long
  3 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-06 20:37 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.18.0-test+ #2 Tainted: G S
  ------------------------------------------------------
  test_cpuset_prs/10971 is trying to acquire lock:
  ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180

  but task is already holding lock:
  ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:
  -> #4 (cpuset_mutex){+.+.}-{4:4}:
  -> #3 (cpu_hotplug_lock){++++}-{0:0}:
  -> #2 (rtnl_mutex){+.+.}-{4:4}:
  -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
  -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

  Chain exists of:
    (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

  5 locks held by test_cpuset_prs/10971:
   #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
   #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
   #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
   #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
   #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  Call Trace:
   <TASK>
     :
   touch_wq_lockdep_map+0x93/0x180
   __flush_workqueue+0x111/0x10b0
   housekeeping_update+0x12d/0x2d0
   update_parent_effective_cpumask+0x595/0x2440
   update_prstate+0x89d/0xce0
   cpuset_partition_write+0xc5/0x130
   cgroup_file_write+0x1a5/0x680
   kernfs_fop_write_iter+0x3df/0x5f0
   vfs_write+0x525/0xfd0
   ksys_write+0xf9/0x1d0
   do_syscall_64+0x95/0x520
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

One way to do that is to defer the housekeeping_update() call after
the current cpuset critical section has finished without holding
cpus_read_lock. For cpuset control file write, this can be done by
deferring it using task_work right before returning to userspace.

To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c        | 107 +++++++++++++++++++++++++++-------
 kernel/sched/isolation.c      |   4 +-
 kernel/time/timer_migration.c |   4 +-
 3 files changed, 89 insertions(+), 26 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index eb0eabd85e8c..d26c77a726b2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
  * CPUSET Locking Convention
  * -------------------------
  *
- * Below are the three global locks guarding cpuset structures in lock
+ * Below are the four global/local locks guarding cpuset structures in lock
  * acquisition order:
+ *  - cpuset_top_mutex
  *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
  *  - cpuset_mutex
  *  - callback_lock (raw spinlock)
  *
- * A task must hold all the three locks to modify externally visible or
- * used fields of cpusets, though some of the internally used cpuset fields
+ * As cpuset will now indirectly flush a number of different workqueues in
+ * housekeeping_update() to update housekeeping cpumasks when the set of
+ * isolated CPUs is going to be changed, it may be vulnerable to deadlock
+ * if we hold cpus_read_lock while calling into housekeeping_update().
+ *
+ * The first cpuset_top_mutex will be held except when calling into
+ * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
+ * and cpuset_mutex will be held instead. The main purpose of this mutex
+ * is to prevent regular cpuset control file write actions from interfering
+ * with the call to housekeeping_update(), though CPU hotplug operation can
+ * still happen in parallel. This mutex also provides protection for some
+ * internal variables.
+ *
+ * A task must hold all the remaining three locks to modify externally visible
+ * or used fields of cpusets, though some of the internally used cpuset fields
  * and internal variables can be modified without holding callback_lock. If only
  * reliable read access of the externally used fields are needed, a task can
  * hold either cpuset_mutex or callback_lock which are exposed to other
@@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
  * cpumasks and nodemasks.
  */
 
+static DEFINE_MUTEX(cpuset_top_mutex);
 static DEFINE_MUTEX(cpuset_mutex);
 
 /*
@@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
  *
  * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
  *	 by holding both cpuset_mutex and callback_lock.
+ *
+ * T:	 Read/write-able by holding the cpuset_top_mutex.
  */
 
 /*
@@ -135,6 +152,13 @@ static cpumask_var_t	isolated_cpus;		/* CSCB */
  */
 static bool		isolated_cpus_updating;	/* RWCS */
 
+/*
+ * Copy of isolated_cpus to be processed by housekeeping_update()
+ */
+static cpumask_var_t	isolated_hk_cpus;	/* T */
+static bool		isolcpus_twork_queued;	/* T */
+
+
 /*
  * A flag to force sched domain rebuild at the end of an operation.
  * It can be set in
@@ -298,6 +322,7 @@ void lockdep_assert_cpuset_lock_held(void)
  */
 void cpuset_full_lock(void)
 {
+	mutex_lock(&cpuset_top_mutex);
 	cpus_read_lock();
 	mutex_lock(&cpuset_mutex);
 }
@@ -306,12 +331,14 @@ void cpuset_full_unlock(void)
 {
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
+	mutex_unlock(&cpuset_top_mutex);
 }
 
 #ifdef CONFIG_LOCKDEP
 bool lockdep_is_cpuset_held(void)
 {
-	return lockdep_is_held(&cpuset_mutex);
+	return lockdep_is_held(&cpuset_mutex) ||
+	       lockdep_is_held(&cpuset_top_mutex);
 }
 #endif
 
@@ -1302,30 +1329,53 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 	return false;
 }
 
-static void isolcpus_workfn(struct work_struct *work)
+/*
+ * housekeeping_update() will only be called if isolated_cpus differs
+ * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always
+ * be called just in case there are still pending sched domains changes.
+ */
+static void do_housekeeping_update(bool *flag)
 {
-	cpuset_full_lock();
-	if (isolated_cpus_updating) {
-		isolated_cpus_updating = false;
-		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
-		rebuild_sched_domains_locked();
+	bool update_hk = true;
+
+	guard(mutex)(&cpuset_top_mutex);
+	if (flag)
+		*flag = false;
+	scoped_guard(spinlock_irq, &callback_lock) {
+		if (cpumask_equal(isolated_hk_cpus, isolated_cpus))
+			update_hk = false;
+		else
+			cpumask_copy(isolated_hk_cpus, isolated_cpus);
 	}
-	cpuset_full_unlock();
+	if (update_hk)
+		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0);
+	rebuild_sched_domains();
+}
+
+static void isolcpus_workfn(struct work_struct *work)
+{
+	do_housekeeping_update(NULL);
+}
+
+static void isolcpus_tworkfn(struct callback_head *cb)
+{
+	/* Clear isolcpus_twork_queued */
+	do_housekeeping_update(&isolcpus_twork_queued);
 }
 
 /*
  * update_isolation_cpumasks - Update external isolation related CPU masks
- *
- * The following external CPU masks will be updated if necessary:
- * - workqueue unbound cpumask
  */
 static void update_isolation_cpumasks(void)
 {
 	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
+	static struct callback_head twork_cb;
 
 	lockdep_assert_cpuset_lock_held();
 	if (!isolated_cpus_updating)
 		return;
+	else
+		isolated_cpus_updating = false;
 
 	/*
 	 * This function can be reached either directly from regular cpuset
@@ -1333,10 +1383,15 @@ static void update_isolation_cpumasks(void)
 	 * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf
 	 * of the task that initiates CPU shutdown or bringup.
 	 *
-	 * To have better flexibility and prevent the possibility of deadlock
-	 * when calling from CPU hotplug, we defer the housekeeping_update()
-	 * call to after the current cpuset critical section has finished.
-	 * This is done via workqueue.
+	 * To have better flexibility and prevent the possibility of deadlock,
+	 * we defer the housekeeping_update() call to after the current
+	 * cpuset critical section has finished. This is done via task_work
+	 * for cpuset control file write and workqueue for CPU hotplug.
+	 *
+	 * When calling from CPU hotplug, cpuset_top_mutex is not held. So the
+	 * cpuset operation can run asynchronously with do_housekeeping_update().
+	 * This should not be a problem as another isolcpus_workfn() call will
+	 * be scheduled to make sure that housekeeping cpumasks will be updated.
 	 */
 	if (current->flags & PF_KTHREAD) {
 		/*
@@ -1352,8 +1407,19 @@ static void update_isolation_cpumasks(void)
 		return;
 	}
 
-	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
-	isolated_cpus_updating = false;
+	/*
+	 * update_isolation_cpumasks() may be called more than once in the
+	 * same cpuset_mutex critical section.
+	 */
+	lockdep_assert_held(&cpuset_top_mutex);
+	if (isolcpus_twork_queued)
+		return;
+
+	init_task_work(&twork_cb, isolcpus_tworkfn);
+	if (!task_work_add(current, &twork_cb, TWA_RESUME))
+		isolcpus_twork_queued = true;
+	else
+		WARN_ON_ONCE(1);	/* Current task shouldn't be exiting */
 }
 
 /**
@@ -3661,6 +3727,7 @@ int __init cpuset_init(void)
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b725d39c06e..ef152d401fe2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
 	struct cpumask *trial, *old = NULL;
 	int err;
 
-	lockdep_assert_cpus_held();
-
 	trial = kmalloc(cpumask_size(), GFP_KERNEL);
 	if (!trial)
 		return -ENOMEM;
@@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
 	}
 
 	if (!housekeeping.flags)
-		static_branch_enable_cpuslocked(&housekeeping_overridden);
+		static_branch_enable(&housekeeping_overridden);
 
 	if (housekeeping.flags & HK_FLAG_DOMAIN)
 		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 6da9cd562b20..83428aa03aef 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
 	int cpu;
 
-	lockdep_assert_cpus_held();
-
 	if (!works)
 		return -ENOMEM;
 	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
@@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	 * First set previously isolated CPUs as available (unisolate).
 	 * This cpumask contains only CPUs that switched to available now.
 	 */
+	guard(cpus_read_lock)();
 	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
 	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
 
@@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void)
 	cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 	/* Protect against RCU torture hotplug testing */
-	guard(cpus_read_lock)();
 	return tmigr_isolated_exclude_cpumask(cpumask);
 }
 late_initcall(tmigr_init_isolation);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-06 20:37 ` [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
@ 2026-02-09  7:12   ` Chen Ridong
  2026-02-09 20:29     ` Waiman Long
  2026-02-09  7:23   ` Chen Ridong
  1 sibling, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2026-02-09  7:12 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/7 4:37, Waiman Long wrote:
> The current cpuset partition code is able to dynamically update
> the sched domains of a running system and the corresponding
> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
> "isolcpus=domain,..." boot command line feature at run time.
> 
> The housekeeping cpumask update requires flushing a number of different
> workqueues which may not be safe with cpus_read_lock() held as the
> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
> which have locking dependency with cpus_read_lock() down the chain. Below
> is an example of such circular locking problem.
> 
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   6.18.0-test+ #2 Tainted: G S
>   ------------------------------------------------------
>   test_cpuset_prs/10971 is trying to acquire lock:
>   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
> 
>   but task is already holding lock:
>   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
>   -> #4 (cpuset_mutex){+.+.}-{4:4}:
>   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>   -> #2 (rtnl_mutex){+.+.}-{4:4}:
>   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
> 
>   Chain exists of:
>     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
> 
>   5 locks held by test_cpuset_prs/10971:
>    #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>    #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>    #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>    #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>    #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   Call Trace:
>    <TASK>
>      :
>    touch_wq_lockdep_map+0x93/0x180
>    __flush_workqueue+0x111/0x10b0
>    housekeeping_update+0x12d/0x2d0
>    update_parent_effective_cpumask+0x595/0x2440
>    update_prstate+0x89d/0xce0
>    cpuset_partition_write+0xc5/0x130
>    cgroup_file_write+0x1a5/0x680
>    kernfs_fop_write_iter+0x3df/0x5f0
>    vfs_write+0x525/0xfd0
>    ksys_write+0xf9/0x1d0
>    do_syscall_64+0x95/0x520
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> To avoid such a circular locking dependency problem, we have to
> call housekeeping_update() without holding the cpus_read_lock() and
> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
> may not have work functions that call cpus_read_lock() directly,
> but we are likely to extend the list of wq's that are flushed in the
> future. Moreover, the current set of work functions may hold locks that
> may have cpu_hotplug_lock down the dependency chain.
> 
> One way to do that is to defer the housekeeping_update() call after
> the current cpuset critical section has finished without holding
> cpus_read_lock. For cpuset control file write, this can be done by
> deferring it using task_work right before returning to userspace.
> 
> To enable mutual exclusion between the housekeeping_update() call and
> other cpuset control file write actions, a new top level cpuset_top_mutex
> is introduced. This new mutex will be acquired first to allow sharing
> variables used by both code paths. However, cpuset update from CPU
> hotplug can still happen in parallel with the housekeeping_update()
> call, though that should be rare in production environment.
> 
> As cpus_read_lock() is now no longer held when
> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
> directly.
> 
> The lockdep_is_cpuset_held() is also updated to return true if either
> cpuset_top_mutex or cpuset_mutex is held.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c        | 107 +++++++++++++++++++++++++++-------
>  kernel/sched/isolation.c      |   4 +-
>  kernel/time/timer_migration.c |   4 +-
>  3 files changed, 89 insertions(+), 26 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index eb0eabd85e8c..d26c77a726b2 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
>   * CPUSET Locking Convention
>   * -------------------------
>   *
> - * Below are the three global locks guarding cpuset structures in lock
> + * Below are the four global/local locks guarding cpuset structures in lock
>   * acquisition order:
> + *  - cpuset_top_mutex
>   *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
>   *  - cpuset_mutex
>   *  - callback_lock (raw spinlock)
>   *
> - * A task must hold all the three locks to modify externally visible or
> - * used fields of cpusets, though some of the internally used cpuset fields
> + * As cpuset will now indirectly flush a number of different workqueues in
> + * housekeeping_update() to update housekeeping cpumasks when the set of
> + * isolated CPUs is going to be changed, it may be vulnerable to deadlock
> + * if we hold cpus_read_lock while calling into housekeeping_update().
> + *
> + * The first cpuset_top_mutex will be held except when calling into
> + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
> + * and cpuset_mutex will be held instead. The main purpose of this mutex
> + * is to prevent regular cpuset control file write actions from interfering
> + * with the call to housekeeping_update(), though CPU hotplug operation can
> + * still happen in parallel. This mutex also provides protection for some
> + * internal variables.
> + *
> + * A task must hold all the remaining three locks to modify externally visible
> + * or used fields of cpusets, though some of the internally used cpuset fields
>   * and internal variables can be modified without holding callback_lock. If only
>   * reliable read access of the externally used fields are needed, a task can
>   * hold either cpuset_mutex or callback_lock which are exposed to other
> @@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
>   * cpumasks and nodemasks.
>   */
>  
> +static DEFINE_MUTEX(cpuset_top_mutex);
>  static DEFINE_MUTEX(cpuset_mutex);
>  
>  /*
> @@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
>   *
>   * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
>   *	 by holding both cpuset_mutex and callback_lock.
> + *
> + * T:	 Read/write-able by holding the cpuset_top_mutex.
>   */
>  
>  /*
> @@ -135,6 +152,13 @@ static cpumask_var_t	isolated_cpus;		/* CSCB */
>   */
>  static bool		isolated_cpus_updating;	/* RWCS */
>  
> +/*
> + * Copy of isolated_cpus to be processed by housekeeping_update()
> + */
> +static cpumask_var_t	isolated_hk_cpus;	/* T */
> +static bool		isolcpus_twork_queued;	/* T */
> +
> +
>  /*
>   * A flag to force sched domain rebuild at the end of an operation.
>   * It can be set in
> @@ -298,6 +322,7 @@ void lockdep_assert_cpuset_lock_held(void)
>   */
>  void cpuset_full_lock(void)
>  {
> +	mutex_lock(&cpuset_top_mutex);
>  	cpus_read_lock();
>  	mutex_lock(&cpuset_mutex);
>  }
> @@ -306,12 +331,14 @@ void cpuset_full_unlock(void)
>  {
>  	mutex_unlock(&cpuset_mutex);
>  	cpus_read_unlock();
> +	mutex_unlock(&cpuset_top_mutex);
>  }
>  
>  #ifdef CONFIG_LOCKDEP
>  bool lockdep_is_cpuset_held(void)
>  {
> -	return lockdep_is_held(&cpuset_mutex);
> +	return lockdep_is_held(&cpuset_mutex) ||
> +	       lockdep_is_held(&cpuset_top_mutex);
>  }
>  #endif
>  
> @@ -1302,30 +1329,53 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  	return false;
>  }
>  
> -static void isolcpus_workfn(struct work_struct *work)
> +/*
> + * housekeeping_update() will only be called if isolated_cpus differs
> + * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always
> + * be called just in case there are still pending sched domains changes.
> + */
> +static void do_housekeeping_update(bool *flag)
>  {
> -	cpuset_full_lock();
> -	if (isolated_cpus_updating) {
> -		isolated_cpus_updating = false;
> -		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> -		rebuild_sched_domains_locked();
> +	bool update_hk = true;
> +
> +	guard(mutex)(&cpuset_top_mutex);
> +	if (flag)
> +		*flag = false;
> +	scoped_guard(spinlock_irq, &callback_lock) {
> +		if (cpumask_equal(isolated_hk_cpus, isolated_cpus))
> +			update_hk = false;
> +		else
> +			cpumask_copy(isolated_hk_cpus, isolated_cpus);
>  	}
> -	cpuset_full_unlock();
> +	if (update_hk)
> +		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0);
> +	rebuild_sched_domains();
> +}
> +
> +static void isolcpus_workfn(struct work_struct *work)
> +{
> +	do_housekeeping_update(NULL);
> +}
> +
> +static void isolcpus_tworkfn(struct callback_head *cb)
> +{
> +	/* Clear isolcpus_twork_queued */
> +	do_housekeeping_update(&isolcpus_twork_queued);
>  }
>  
>  /*
>   * update_isolation_cpumasks - Update external isolation related CPU masks
> - *
> - * The following external CPU masks will be updated if necessary:
> - * - workqueue unbound cpumask
>   */
>  static void update_isolation_cpumasks(void)
>  {
>  	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
> +	static struct callback_head twork_cb;
>  
>  	lockdep_assert_cpuset_lock_held();
>  	if (!isolated_cpus_updating)
>  		return;
> +	else
> +		isolated_cpus_updating = false;
>  
>  	/*
>  	 * This function can be reached either directly from regular cpuset
> @@ -1333,10 +1383,15 @@ static void update_isolation_cpumasks(void)
>  	 * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf
>  	 * of the task that initiates CPU shutdown or bringup.
>  	 *
> -	 * To have better flexibility and prevent the possibility of deadlock
> -	 * when calling from CPU hotplug, we defer the housekeeping_update()
> -	 * call to after the current cpuset critical section has finished.
> -	 * This is done via workqueue.
> +	 * To have better flexibility and prevent the possibility of deadlock,
> +	 * we defer the housekeeping_update() call to after the current
> +	 * cpuset critical section has finished. This is done via task_work
> +	 * for cpuset control file write and workqueue for CPU hotplug.
> +	 *
> +	 * When calling from CPU hotplug, cpuset_top_mutex is not held. So the
> +	 * cpuset operation can run asynchronously with do_housekeeping_update().
> +	 * This should not be a problem as another isolcpus_workfn() call will
> +	 * be scheduled to make sure that housekeeping cpumasks will be updated.
>  	 */
>  	if (current->flags & PF_KTHREAD) {
>  		/*
> @@ -1352,8 +1407,19 @@ static void update_isolation_cpumasks(void)
>  		return;
>  	}
>  
> -	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> -	isolated_cpus_updating = false;
> +	/*
> +	 * update_isolation_cpumasks() may be called more than once in the
> +	 * same cpuset_mutex critical section.
> +	 */
> +	lockdep_assert_held(&cpuset_top_mutex);
> +	if (isolcpus_twork_queued)
> +		return;
> +
> +	init_task_work(&twork_cb, isolcpus_tworkfn);
> +	if (!task_work_add(current, &twork_cb, TWA_RESUME))
> +		isolcpus_twork_queued = true;
> +	else
> +		WARN_ON_ONCE(1);	/* Current task shouldn't be exiting */
>  }
>  

Timeline:

user A			user B
write isolated cpus	write isolated cpus
isolated_cpus_update
update_isolation_cpumasks
task_work_add
isolcpus_twork_queued =true

// before returning userspace
// waiting for worker
			isolated_cpus_update
			if (isolcpus_twork_queued)
				return // Early exit
			// return to userspace

// workqueue finishes
// return to userspace

For User B, the isolated_cpus value appears to be set and the syscall returns
successfully to userspace. However, because isolcpus_twork_queued was already
true (set by User A), User B's call skipped the actual mask update
(update_isolation_cpumasks).
Thus, the new isolated_cpus value is not yet effective in the kernel, even
though User B's write operation returned without error.

Is this a valid issue? Should User B's write be blocked?

>  /**
> @@ -3661,6 +3727,7 @@ int __init cpuset_init(void)
>  	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
>  	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
>  	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
> +	BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
>  
>  	cpumask_setall(top_cpuset.cpus_allowed);
>  	nodes_setall(top_cpuset.mems_allowed);
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 3b725d39c06e..ef152d401fe2 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
>  	struct cpumask *trial, *old = NULL;
>  	int err;
>  
> -	lockdep_assert_cpus_held();
> -
>  	trial = kmalloc(cpumask_size(), GFP_KERNEL);
>  	if (!trial)
>  		return -ENOMEM;
> @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
>  	}
>  
>  	if (!housekeeping.flags)
> -		static_branch_enable_cpuslocked(&housekeeping_overridden);
> +		static_branch_enable(&housekeeping_overridden);
>  
>  	if (housekeeping.flags & HK_FLAG_DOMAIN)
>  		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 6da9cd562b20..83428aa03aef 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
>  	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
>  	int cpu;
>  
> -	lockdep_assert_cpus_held();
> -
>  	if (!works)
>  		return -ENOMEM;
>  	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
>  	 * First set previously isolated CPUs as available (unisolate).
>  	 * This cpumask contains only CPUs that switched to available now.
>  	 */
> +	guard(cpus_read_lock)();
>  	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
>  	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
>  
> @@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void)
>  	cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
>  
>  	/* Protect against RCU torture hotplug testing */
> -	guard(cpus_read_lock)();
>  	return tmigr_isolated_exclude_cpumask(cpumask);
>  }
>  late_initcall(tmigr_init_isolation);

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-09  7:12   ` Chen Ridong
@ 2026-02-09 20:29     ` Waiman Long
  2026-02-10  1:29       ` Chen Ridong
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2026-02-09 20:29 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/9/26 2:12 AM, Chen Ridong wrote:
>>   		return;
>>   	}
>>   
>> -	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>> -	isolated_cpus_updating = false;
>> +	/*
>> +	 * update_isolation_cpumasks() may be called more than once in the
>> +	 * same cpuset_mutex critical section.
>> +	 */
>> +	lockdep_assert_held(&cpuset_top_mutex);
>> +	if (isolcpus_twork_queued)
>> +		return;
>> +
>> +	init_task_work(&twork_cb, isolcpus_tworkfn);
>> +	if (!task_work_add(current, &twork_cb, TWA_RESUME))
>> +		isolcpus_twork_queued = true;
>> +	else
>> +		WARN_ON_ONCE(1);	/* Current task shouldn't be exiting */
>>   }
>>   
> Timeline:
>
> user A			user B
> write isolated cpus	write isolated cpus
> isolated_cpus_update
> update_isolation_cpumasks
> task_work_add
> isolcpus_twork_queued =true
>
> // before returning userspace
> // waiting for worker
> 			isolated_cpus_update
> 			if (isolcpus_twork_queued)
> 				return // Early exit
> 			// return to userspace
>
> // workqueue finishes
> // return to userspace
>
> For User B, the isolated_cpus value appears to be set and the syscall returns
> successfully to userspace. However, because isolcpus_twork_queued was already
> true (set by User A), User B's call skipped the actual mask update
> (update_isolation_cpumasks).
> Thus, the new isolated_cpus value is not yet effective in the kernel, even
> though User B's write operation returned without error.
>
> Is this a valid issue? Should User B's write be blocked?

It is perfectly possible that isolated_cpus can be modified more than 
one time from different tasks before a work or task_work function is 
executed. When that function is invoked, isolated_cpus should contain 
changes for both. It will copy isolated_cpus to isolated_hk_cpus and 
pass it to housekeeping_update(). When the 2nd work or task_work 
function is invoked, it will see that isolated_cpus match 
isolated_hk_cpus and skip the housekeeping_update() action. There is no 
need to block user B's write as only one task can update isolated_cpus 
at any time.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-09 20:29     ` Waiman Long
@ 2026-02-10  1:29       ` Chen Ridong
  2026-02-10 14:01         ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2026-02-10  1:29 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/10 4:29, Waiman Long wrote:
> On 2/9/26 2:12 AM, Chen Ridong wrote:
>>>           return;
>>>       }
>>>   -    WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>> -    isolated_cpus_updating = false;
>>> +    /*
>>> +     * update_isolation_cpumasks() may be called more than once in the
>>> +     * same cpuset_mutex critical section.
>>> +     */
>>> +    lockdep_assert_held(&cpuset_top_mutex);
>>> +    if (isolcpus_twork_queued)
>>> +        return;
>>> +
>>> +    init_task_work(&twork_cb, isolcpus_tworkfn);
>>> +    if (!task_work_add(current, &twork_cb, TWA_RESUME))
>>> +        isolcpus_twork_queued = true;
>>> +    else
>>> +        WARN_ON_ONCE(1);    /* Current task shouldn't be exiting */
>>>   }
>>>   
>> Timeline:
>>
>> user A            user B
>> write isolated cpus    write isolated cpus
>> isolated_cpus_update
>> update_isolation_cpumasks
>> task_work_add
>> isolcpus_twork_queued =true
>>
>> // before returning userspace
>> // waiting for worker
>>             isolated_cpus_update
>>             if (isolcpus_twork_queued)
>>                 return // Early exit
>>             // return to userspace
>>
>> // workqueue finishes
>> // return to userspace
>>
>> For User B, the isolated_cpus value appears to be set and the syscall returns
>> successfully to userspace. However, because isolcpus_twork_queued was already
>> true (set by User A), User B's call skipped the actual mask update
>> (update_isolation_cpumasks).
>> Thus, the new isolated_cpus value is not yet effective in the kernel, even
>> though User B's write operation returned without error.
>>
>> Is this a valid issue? Should User B's write be blocked?
> 
> It is perfectly possible that isolated_cpus can be modified more than one time
> from different tasks before a work or task_work function is executed. When that
> function is invoked, isolated_cpus should contain changes for both. It will copy
> isolated_cpus to isolated_hk_cpus and pass it to housekeeping_update(). When the

It is clear about isolated_hk_cpus and isolated_cpus.

> 2nd work or task_work function is invoked, it will see that isolated_cpus match
> isolated_hk_cpus and skip the housekeeping_update() action. There is no need to
> block user B's write as only one task can update isolated_cpus at any time.
> 

The main question remains: user B receives a success return even though
isolated_hk_cpus has not yet taken effect (i.e.,
/sys/devices/system/cpu/isolated does not reflect the change). In that case, how
can user B confirm whether their configuration is actually applied?

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-10  1:29       ` Chen Ridong
@ 2026-02-10 14:01         ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-10 14:01 UTC (permalink / raw)
  To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/9/26 8:29 PM, Chen Ridong wrote:
>
> On 2026/2/10 4:29, Waiman Long wrote:
>> On 2/9/26 2:12 AM, Chen Ridong wrote:
>>>>            return;
>>>>        }
>>>>    -    WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>>>> -    isolated_cpus_updating = false;
>>>> +    /*
>>>> +     * update_isolation_cpumasks() may be called more than once in the
>>>> +     * same cpuset_mutex critical section.
>>>> +     */
>>>> +    lockdep_assert_held(&cpuset_top_mutex);
>>>> +    if (isolcpus_twork_queued)
>>>> +        return;
>>>> +
>>>> +    init_task_work(&twork_cb, isolcpus_tworkfn);
>>>> +    if (!task_work_add(current, &twork_cb, TWA_RESUME))
>>>> +        isolcpus_twork_queued = true;
>>>> +    else
>>>> +        WARN_ON_ONCE(1);    /* Current task shouldn't be exiting */
>>>>    }
>>>>    
>>> Timeline:
>>>
>>> user A            user B
>>> write isolated cpus    write isolated cpus
>>> isolated_cpus_update
>>> update_isolation_cpumasks
>>> task_work_add
>>> isolcpus_twork_queued =true
>>>
>>> // before returning userspace
>>> // waiting for worker
>>>              isolated_cpus_update
>>>              if (isolcpus_twork_queued)
>>>                  return // Early exit
>>>              // return to userspace
>>>
>>> // workqueue finishes
>>> // return to userspace
>>>
>>> For User B, the isolated_cpus value appears to be set and the syscall returns
>>> successfully to userspace. However, because isolcpus_twork_queued was already
>>> true (set by User A), User B's call skipped the actual mask update
>>> (update_isolation_cpumasks).
>>> Thus, the new isolated_cpus value is not yet effective in the kernel, even
>>> though User B's write operation returned without error.
>>>
>>> Is this a valid issue? Should User B's write be blocked?
>> It is perfectly possible that isolated_cpus can be modified more than one time
>> from different tasks before a work or task_work function is executed. When that
>> function is invoked, isolated_cpus should contain changes for both. It will copy
>> isolated_cpus to isolated_hk_cpus and pass it to housekeeping_update(). When the
> It is clear about isolated_hk_cpus and isolated_cpus.
>
>> 2nd work or task_work function is invoked, it will see that isolated_cpus match
>> isolated_hk_cpus and skip the housekeeping_update() action. There is no need to
>> block user B's write as only one task can update isolated_cpus at any time.
>>
> The main question remains: user B receives a success return even though
> isolated_hk_cpus has not yet taken effect (i.e.,
> /sys/devices/system/cpu/isolated does not reflect the change). In that case, how
> can user B confirm whether their configuration is actually applied?

task_work function is synchronous. IOW, if a user writes to a cpuset 
control file to modify an isolated partition, when control is passed 
back to userspace, it is guaranteed that the task_work function, if 
queued, would have been executed.

wq work function, OTOH, is asynchronous. So if a user brings down an 
isolated CPU to make an isolated partition invalid, the supposed changes 
to the sched domains may not be completed by the time the offline 
operation returns. However this is an operation that normal users 
shouldn't do in a production system anyway and they are taking their own 
risk if they try to do it.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-06 20:37 ` [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
  2026-02-09  7:12   ` Chen Ridong
@ 2026-02-09  7:23   ` Chen Ridong
  2026-02-09 20:20     ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2026-02-09  7:23 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/7 4:37, Waiman Long wrote:
> +static cpumask_var_t	isolated_hk_cpus;	/* T */

Can we get this from isolation.c instead?

The name probably shouldn't include 'hk', since it refers to the inverse
(housekeeping CPUs) of isolated CPUs, right?

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-09  7:23   ` Chen Ridong
@ 2026-02-09 20:20     ` Waiman Long
  2026-02-10  1:39       ` Chen Ridong
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2026-02-09 20:20 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/9/26 2:23 AM, Chen Ridong wrote:
>
> On 2026/2/7 4:37, Waiman Long wrote:
>> +static cpumask_var_t	isolated_hk_cpus;	/* T */
> Can we get this from isolation.c instead?
>
> The name probably shouldn't include 'hk', since it refers to the inverse
> (housekeeping CPUs) of isolated CPUs, right?

The housekeeping_update() will create an inverse of the pass-in isolated 
cpumasks. As for the name, I add hk to indicate this cpumask is for 
passing to housekeeping_update() to update housekeeping cpumask. It is 
not directly related to the cpumasks in sched/isolation.c. Please let me 
know if you have  a suggestion for the name.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-09 20:20     ` Waiman Long
@ 2026-02-10  1:39       ` Chen Ridong
  2026-02-10 14:39         ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2026-02-10  1:39 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/10 4:20, Waiman Long wrote:
> On 2/9/26 2:23 AM, Chen Ridong wrote:
>>
>> On 2026/2/7 4:37, Waiman Long wrote:
>>> +static cpumask_var_t    isolated_hk_cpus;    /* T */
>> Can we get this from isolation.c instead?
>>
>> The name probably shouldn't include 'hk', since it refers to the inverse
>> (housekeeping CPUs) of isolated CPUs, right?
> 
> The housekeeping_update() will create an inverse of the pass-in isolated
> cpumasks. As for the name, I add hk to indicate this cpumask is for passing to
> housekeeping_update() to update housekeeping cpumask. It is not directly related
> to the cpumasks in sched/isolation.c. Please let me know if you have  a
> suggestion for the name.
> 

I understand the intent. However, when reading both cpuset.c and
sched/isolation.c, it can be confusing whether isolated_hk_cpus is an inverse
mask, since in sched/isolation.c “hk” consistently refers to the inverse.

How about isolated_cpus_applied?

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
  2026-02-10  1:39       ` Chen Ridong
@ 2026-02-10 14:39         ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-10 14:39 UTC (permalink / raw)
  To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner,
	Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker,
	Thomas Gleixner, Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest


On 2/9/26 8:39 PM, Chen Ridong wrote:
>
> On 2026/2/10 4:20, Waiman Long wrote:
>> On 2/9/26 2:23 AM, Chen Ridong wrote:
>>> On 2026/2/7 4:37, Waiman Long wrote:
>>>> +static cpumask_var_t    isolated_hk_cpus;    /* T */
>>> Can we get this from isolation.c instead?
>>>
>>> The name probably shouldn't include 'hk', since it refers to the inverse
>>> (housekeeping CPUs) of isolated CPUs, right?
>> The housekeeping_update() will create an inverse of the pass-in isolated
>> cpumasks. As for the name, I add hk to indicate this cpumask is for passing to
>> housekeeping_update() to update housekeeping cpumask. It is not directly related
>> to the cpumasks in sched/isolation.c. Please let me know if you have  a
>> suggestion for the name.
>>
> I understand the intent. However, when reading both cpuset.c and
> sched/isolation.c, it can be confusing whether isolated_hk_cpus is an inverse
> mask, since in sched/isolation.c “hk” consistently refers to the inverse.
>
> How about isolated_cpus_applied?

Applied to what? I did add a comment to describe isolated_hk_cpus as a 
copy of isolated_cpus to be passed to housekeeping_update(). "hk" in the 
name refers to its role for being passed to that function. I can't use 
"isolated_cpus" for now as it may get modified by CPU hotplug 
concurrently. In the future, if CPU hotplug no longer modify 
isolated_cpus, I will remove isolated_hk_cpus and pass isolated_cpus 
directly to housekeeping_update(). I don't think we need to spend extra 
time bikeshedding what the right name should be.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls
  2026-02-06 20:37 [PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues Waiman Long
                   ` (2 preceding siblings ...)
  2026-02-06 20:37 ` [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
@ 2026-02-06 20:37 ` Waiman Long
  2026-02-09  7:53   ` Chen Ridong
  3 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2026-02-06 20:37 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long

Now that we are going to defer any changes to the HK_TYPE_DOMAIN
housekeeping cpumasks to either task_work or workqueue
where rebuild_sched_domains() call will be issued. The current
rebuild_sched_domains_locked() call near the end of the cpuset critical
section can be removed in such cases.

Currently, a boolean force_sd_rebuild flag is used to decide if
rebuild_sched_domains_locked() call needs to be invoked. To allow
deferral that like, we change it to a tri-state sd_rebuild enumaration
type.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d26c77a726b2..e224df321e34 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -173,7 +173,11 @@ static bool		isolcpus_twork_queued;	/* T */
  * Note that update_relax_domain_level() in cpuset-v1.c can still call
  * rebuild_sched_domains_locked() directly without using this flag.
  */
-static bool force_sd_rebuild;			/* RWCS */
+static enum {
+	SD_NO_REBUILD = 0,
+	SD_REBUILD,
+	SD_DEFER_REBUILD,
+} sd_rebuild;					/* RWCS */
 
 /*
  * Partition root states:
@@ -990,7 +994,7 @@ void rebuild_sched_domains_locked(void)
 
 	lockdep_assert_cpus_held();
 	lockdep_assert_cpuset_lock_held();
-	force_sd_rebuild = false;
+	sd_rebuild = SD_NO_REBUILD;
 
 	/* Generate domain masks and attrs */
 	ndoms = generate_sched_domains(&doms, &attr);
@@ -1377,6 +1381,9 @@ static void update_isolation_cpumasks(void)
 	else
 		isolated_cpus_updating = false;
 
+	/* Defer rebuild_sched_domains() to task_work or wq */
+	sd_rebuild = SD_DEFER_REBUILD;
+
 	/*
 	 * This function can be reached either directly from regular cpuset
 	 * control file write or via CPU hotplug. In the latter case, it is
@@ -3011,7 +3018,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	update_partition_sd_lb(cs, old_prs);
 
 	notify_partition_change(cs, old_prs);
-	if (force_sd_rebuild)
+	if (sd_rebuild == SD_REBUILD)
 		rebuild_sched_domains_locked();
 	free_tmpmasks(&tmpmask);
 	return 0;
@@ -3288,7 +3295,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	}
 
 	free_cpuset(trialcs);
-	if (force_sd_rebuild)
+	if (sd_rebuild == SD_REBUILD)
 		rebuild_sched_domains_locked();
 out_unlock:
 	cpuset_full_unlock();
@@ -3771,7 +3778,8 @@ hotplug_update_tasks(struct cpuset *cs,
 
 void cpuset_force_rebuild(void)
 {
-	force_sd_rebuild = true;
+	if (!sd_rebuild)
+		sd_rebuild = SD_REBUILD;
 }
 
 /**
@@ -3981,7 +3989,7 @@ static void cpuset_handle_hotplug(void)
 	}
 
 	/* rebuild sched domains if necessary */
-	if (force_sd_rebuild)
+	if (sd_rebuild == SD_REBUILD)
 		rebuild_sched_domains_cpuslocked();
 
 	free_tmpmasks(ptmp);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls
  2026-02-06 20:37 ` [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Waiman Long
@ 2026-02-09  7:53   ` Chen Ridong
  2026-02-09 20:47     ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Ridong @ 2026-02-09  7:53 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest



On 2026/2/7 4:37, Waiman Long wrote:
> Now that we are going to defer any changes to the HK_TYPE_DOMAIN
> housekeeping cpumasks to either task_work or workqueue
> where rebuild_sched_domains() call will be issued. The current
> rebuild_sched_domains_locked() call near the end of the cpuset critical
> section can be removed in such cases.
> 
> Currently, a boolean force_sd_rebuild flag is used to decide if
> rebuild_sched_domains_locked() call needs to be invoked. To allow
> deferral that like, we change it to a tri-state sd_rebuild enumaration
> type.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index d26c77a726b2..e224df321e34 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -173,7 +173,11 @@ static bool		isolcpus_twork_queued;	/* T */
>   * Note that update_relax_domain_level() in cpuset-v1.c can still call
>   * rebuild_sched_domains_locked() directly without using this flag.
>   */
> -static bool force_sd_rebuild;			/* RWCS */
> +static enum {
> +	SD_NO_REBUILD = 0,
> +	SD_REBUILD,
> +	SD_DEFER_REBUILD,
> +} sd_rebuild;					/* RWCS */
>  
>  /*
>   * Partition root states:
> @@ -990,7 +994,7 @@ void rebuild_sched_domains_locked(void)
>  
>  	lockdep_assert_cpus_held();
>  	lockdep_assert_cpuset_lock_held();
> -	force_sd_rebuild = false;
> +	sd_rebuild = SD_NO_REBUILD;
>  
>  	/* Generate domain masks and attrs */
>  	ndoms = generate_sched_domains(&doms, &attr);
> @@ -1377,6 +1381,9 @@ static void update_isolation_cpumasks(void)
>  	else
>  		isolated_cpus_updating = false;
>  

If isolated_hk_cpus is defined, I believe isolated_cpus_updating becomes redundant.

> +	/* Defer rebuild_sched_domains() to task_work or wq */
> +	sd_rebuild = SD_DEFER_REBUILD;
> +

There is a potential issue: we defer all domain rebuilds here, including those
triggered by hotplug events which may change the isolation state.

The problem is that functions like cpuset_cpu_active, which rely on the
scheduler domains being up-to-date—will, also be delayed. Is that okay?

>  	/*
>  	 * This function can be reached either directly from regular cpuset
>  	 * control file write or via CPU hotplug. In the latter case, it is
> @@ -3011,7 +3018,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
>  	update_partition_sd_lb(cs, old_prs);
>  
>  	notify_partition_change(cs, old_prs);
> -	if (force_sd_rebuild)
> +	if (sd_rebuild == SD_REBUILD)
>  		rebuild_sched_domains_locked();
>  	free_tmpmasks(&tmpmask);
>  	return 0;
> @@ -3288,7 +3295,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>  	}
>  
>  	free_cpuset(trialcs);
> -	if (force_sd_rebuild)
> +	if (sd_rebuild == SD_REBUILD)
>  		rebuild_sched_domains_locked();
>  out_unlock:
>  	cpuset_full_unlock();
> @@ -3771,7 +3778,8 @@ hotplug_update_tasks(struct cpuset *cs,
>  
>  void cpuset_force_rebuild(void)
>  {
> -	force_sd_rebuild = true;
> +	if (!sd_rebuild)
> +		sd_rebuild = SD_REBUILD;
>  }
>  
>  /**
> @@ -3981,7 +3989,7 @@ static void cpuset_handle_hotplug(void)
>  	}
>  
>  	/* rebuild sched domains if necessary */
> -	if (force_sd_rebuild)
> +	if (sd_rebuild == SD_REBUILD)
>  		rebuild_sched_domains_cpuslocked();
>  
>  	free_tmpmasks(ptmp);

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls
  2026-02-09  7:53   ` Chen Ridong
@ 2026-02-09 20:47     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2026-02-09 20:47 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
	Shuah Khan
  Cc: cgroups, linux-kernel, linux-kselftest

On 2/9/26 2:53 AM, Chen Ridong wrote:
>
> On 2026/2/7 4:37, Waiman Long wrote:
>> Now that we are going to defer any changes to the HK_TYPE_DOMAIN
>> housekeeping cpumasks to either task_work or workqueue
>> where rebuild_sched_domains() call will be issued. The current
>> rebuild_sched_domains_locked() call near the end of the cpuset critical
>> section can be removed in such cases.
>>
>> Currently, a boolean force_sd_rebuild flag is used to decide if
>> rebuild_sched_domains_locked() call needs to be invoked. To allow
>> deferral that like, we change it to a tri-state sd_rebuild enumaration
>> type.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 20 ++++++++++++++------
>>   1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index d26c77a726b2..e224df321e34 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -173,7 +173,11 @@ static bool		isolcpus_twork_queued;	/* T */
>>    * Note that update_relax_domain_level() in cpuset-v1.c can still call
>>    * rebuild_sched_domains_locked() directly without using this flag.
>>    */
>> -static bool force_sd_rebuild;			/* RWCS */
>> +static enum {
>> +	SD_NO_REBUILD = 0,
>> +	SD_REBUILD,
>> +	SD_DEFER_REBUILD,
>> +} sd_rebuild;					/* RWCS */
>>   
>>   /*
>>    * Partition root states:
>> @@ -990,7 +994,7 @@ void rebuild_sched_domains_locked(void)
>>   
>>   	lockdep_assert_cpus_held();
>>   	lockdep_assert_cpuset_lock_held();
>> -	force_sd_rebuild = false;
>> +	sd_rebuild = SD_NO_REBUILD;
>>   
>>   	/* Generate domain masks and attrs */
>>   	ndoms = generate_sched_domains(&doms, &attr);
>> @@ -1377,6 +1381,9 @@ static void update_isolation_cpumasks(void)
>>   	else
>>   		isolated_cpus_updating = false;
>>   
> If isolated_hk_cpus is defined, I believe isolated_cpus_updating becomes redundant.
Note that they have different exclusion rules. Other than that, you are 
right that "!cpumask_equal(isolated_hk_cpu, isolated_cpus)" should be 
equivalent to isolated_cpus_updating. But because of the different 
exclusion rules, there are restriction on where you can use one or the 
other.
>
>> +	/* Defer rebuild_sched_domains() to task_work or wq */
>> +	sd_rebuild = SD_DEFER_REBUILD;
>> +
> There is a potential issue: we defer all domain rebuilds here, including those
> triggered by hotplug events which may change the isolation state.
>
> The problem is that functions like cpuset_cpu_active, which rely on the
> scheduler domains being up-to-date—will, also be delayed. Is that okay?

No, we are not deferring all domain rebuilds. We are just deferring 
domain rebuilds that involves changes in the set of isolated CPUs. 
Domains rebuild will still happen if there is no changes in the set of 
isolated CPUs. I need to take a further to investigate if this is a 
problem or not. Anyway s suggested in my reply to Federic, I am 
considering to not changing isolated_cpus due to hotplug events. In that 
case, this problem should be gone.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-02-10 18:53 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-06 20:37 [PATCH/for-next v4 0/4] cgroup/cpuset: Fix partition related locking issues Waiman Long
2026-02-06 20:37 ` [PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
2026-02-09  3:41   ` Chen Ridong
2026-02-09 19:58     ` Waiman Long
2026-02-06 20:37 ` [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
2026-02-06 22:28   ` Frederic Weisbecker
2026-02-08  2:00     ` Waiman Long
2026-02-10 15:46       ` Frederic Weisbecker
2026-02-10 18:53         ` Waiman Long
2026-02-09  6:57   ` Chen Ridong
2026-02-06 20:37 ` [PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
2026-02-09  7:12   ` Chen Ridong
2026-02-09 20:29     ` Waiman Long
2026-02-10  1:29       ` Chen Ridong
2026-02-10 14:01         ` Waiman Long
2026-02-09  7:23   ` Chen Ridong
2026-02-09 20:20     ` Waiman Long
2026-02-10  1:39       ` Chen Ridong
2026-02-10 14:39         ` Waiman Long
2026-02-06 20:37 ` [PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Waiman Long
2026-02-09  7:53   ` Chen Ridong
2026-02-09 20:47     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox