* [PATCH v5 1/6] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
2026-02-12 16:46 [PATCH v5 0/6] cgroup/cpuset: Fix partition related locking issues Waiman Long
@ 2026-02-12 16:46 ` Waiman Long
2026-02-13 0:52 ` Chen Ridong
2026-02-12 16:46 ` [PATCH v5 2/6] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
` (4 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-02-12 16:46 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
The effective_xcpus of a cpuset can contain offline CPUs. In
partition_xcpus_del(), the xcpus parameter is incorrectly used as
a temporary cpumask to mask out offline CPUs. As xcpus can be the
effective_xcpus of a cpuset, this can result in unexpected changes
in that cpumask. Fix this problem by not making any changes to the
xcpus parameter.
Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c43efef7df71..a366ef84f982 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1221,8 +1221,8 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
isolated_cpus_update(old_prs, parent->partition_root_state,
xcpus);
- cpumask_and(xcpus, xcpus, cpu_active_mask);
cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
+ cpumask_and(parent->effective_cpus, parent->effective_cpus, cpu_active_mask);
}
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 1/6] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
2026-02-12 16:46 ` [PATCH v5 1/6] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
@ 2026-02-13 0:52 ` Chen Ridong
0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2026-02-13 0:52 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2026/2/13 0:46, Waiman Long wrote:
> The effective_xcpus of a cpuset can contain offline CPUs. In
> partition_xcpus_del(), the xcpus parameter is incorrectly used as
> a temporary cpumask to mask out offline CPUs. As xcpus can be the
> effective_xcpus of a cpuset, this can result in unexpected changes
> in that cpumask. Fix this problem by not making any changes to the
> xcpus parameter.
>
> Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index c43efef7df71..a366ef84f982 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1221,8 +1221,8 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
> isolated_cpus_update(old_prs, parent->partition_root_state,
> xcpus);
>
> - cpumask_and(xcpus, xcpus, cpu_active_mask);
> cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
> + cpumask_and(parent->effective_cpus, parent->effective_cpus, cpu_active_mask);
> }
>
> /*
The final content of parent->effective_cpus remains unchanged, and the fix
looks correct to me.
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 2/6] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
2026-02-12 16:46 [PATCH v5 0/6] cgroup/cpuset: Fix partition related locking issues Waiman Long
2026-02-12 16:46 ` [PATCH v5 1/6] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
@ 2026-02-12 16:46 ` Waiman Long
2026-02-13 1:35 ` Chen Ridong
2026-02-12 16:46 ` [PATCH v5 3/6] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
` (3 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-02-12 16:46 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
Clarify the locking rules associated with file level internal variables
inside the cpuset code. There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
1 file changed, 61 insertions(+), 44 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a366ef84f982..e55855269432 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -61,6 +61,58 @@ static const char * const perr_strings[] = {
[PERR_REMOTE] = "Have remote partition underneath",
};
+/*
+ * CPUSET Locking Convention
+ * -------------------------
+ *
+ * Below are the three global locks guarding cpuset structures in lock
+ * acquisition order:
+ * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
+ * - cpuset_mutex
+ * - callback_lock (raw spinlock)
+ *
+ * A task must hold all the three locks to modify externally visible or
+ * used fields of cpusets, though some of the internally used cpuset fields
+ * and internal variables can be modified without holding callback_lock. If only
+ * reliable read access of the externally used fields are needed, a task can
+ * hold either cpuset_mutex or callback_lock which are exposed to other
+ * external subsystems.
+ *
+ * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
+ * ensuring that it is the only task able to also acquire callback_lock and
+ * be able to modify cpusets. It can perform various checks on the cpuset
+ * structure first, knowing nothing will change. It can also allocate memory
+ * without holding callback_lock. While it is performing these checks, various
+ * callback routines can briefly acquire callback_lock to query cpusets. Once
+ * it is ready to make the changes, it takes callback_lock, blocking everyone
+ * else.
+ *
+ * Calls to the kernel memory allocator cannot be made while holding
+ * callback_lock which is a spinlock, as the memory allocator may sleep or
+ * call back into cpuset code and acquire callback_lock.
+ *
+ * Now, the task_struct fields mems_allowed and mempolicy may be changed
+ * by other task, we use alloc_lock in the task_struct fields to protect
+ * them.
+ *
+ * The cpuset_common_seq_show() handlers only hold callback_lock across
+ * small pieces of code, such as when reading out possibly multi-word
+ * cpumasks and nodemasks.
+ */
+
+static DEFINE_MUTEX(cpuset_mutex);
+
+/*
+ * File level internal variables below follow one of the following exclusion
+ * rules.
+ *
+ * RWCS: Read/write-able by holding either cpus_write_lock (and optionally
+ * cpuset_mutex) or both cpus_read_lock and cpuset_mutex.
+ *
+ * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
+ * by holding both cpuset_mutex and callback_lock.
+ */
+
/*
* For local partitions, update to subpartitions_cpus & isolated_cpus is done
* in update_parent_effective_cpumask(). For remote partitions, it is done in
@@ -70,19 +122,18 @@ static const char * const perr_strings[] = {
* Exclusive CPUs distributed out to local or remote sub-partitions of
* top_cpuset
*/
-static cpumask_var_t subpartitions_cpus;
+static cpumask_var_t subpartitions_cpus; /* RWCS */
/*
- * Exclusive CPUs in isolated partitions
+ * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated)
*/
-static cpumask_var_t isolated_cpus;
+static cpumask_var_t isolated_cpus; /* CSCB */
/*
- * isolated_cpus updating flag (protected by cpuset_mutex)
- * Set if isolated_cpus is going to be updated in the current
- * cpuset_mutex crtical section.
+ * Set if isolated_cpus is being updated in the current cpuset_mutex
+ * critical section.
*/
-static bool isolated_cpus_updating;
+static bool isolated_cpus_updating; /* RWCS */
/*
* A flag to force sched domain rebuild at the end of an operation.
@@ -98,7 +149,7 @@ static bool isolated_cpus_updating;
* Note that update_relax_domain_level() in cpuset-v1.c can still call
* rebuild_sched_domains_locked() directly without using this flag.
*/
-static bool force_sd_rebuild;
+static bool force_sd_rebuild; /* RWCS */
/*
* Partition root states:
@@ -218,42 +269,6 @@ struct cpuset top_cpuset = {
.partition_root_state = PRS_ROOT,
};
-/*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
- * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
- * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
- * structures. Note that cpuset_mutex needs to be a mutex as it is used in
- * paths that rely on priority inheritance (e.g. scheduler - on RT) for
- * correctness.
- *
- * A task must hold both locks to modify cpusets. If a task holds
- * cpuset_mutex, it blocks others, ensuring that it is the only task able to
- * also acquire callback_lock and be able to modify cpusets. It can perform
- * various checks on the cpuset structure first, knowing nothing will change.
- * It can also allocate memory while just holding cpuset_mutex. While it is
- * performing these checks, various callback routines can briefly acquire
- * callback_lock to query cpusets. Once it is ready to make the changes, it
- * takes callback_lock, blocking everyone else.
- *
- * Calls to the kernel memory allocator can not be made while holding
- * callback_lock, as that would risk double tripping on callback_lock
- * from one of the callbacks into the cpuset code from within
- * __alloc_pages().
- *
- * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
- *
- * Now, the task_struct fields mems_allowed and mempolicy may be changed
- * by other task, we use alloc_lock in the task_struct fields to protect
- * them.
- *
- * The cpuset_common_seq_show() handlers only hold callback_lock across
- * small pieces of code, such as when reading out possibly multi-word
- * cpumasks and nodemasks.
- */
-
-static DEFINE_MUTEX(cpuset_mutex);
-
/**
* cpuset_lock - Acquire the global cpuset mutex
*
@@ -1163,6 +1178,8 @@ static void reset_partition_data(struct cpuset *cs)
static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
{
WARN_ON_ONCE(old_prs == new_prs);
+ lockdep_assert_held(&callback_lock);
+ lockdep_assert_held(&cpuset_mutex);
if (new_prs == PRS_ISOLATED)
cpumask_or(isolated_cpus, isolated_cpus, xcpus);
else
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 2/6] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
2026-02-12 16:46 ` [PATCH v5 2/6] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
@ 2026-02-13 1:35 ` Chen Ridong
0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2026-02-13 1:35 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2026/2/13 0:46, Waiman Long wrote:
> Clarify the locking rules associated with file level internal variables
> inside the cpuset code. There is no functional change.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 105 ++++++++++++++++++++++++-----------------
> 1 file changed, 61 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a366ef84f982..e55855269432 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -61,6 +61,58 @@ static const char * const perr_strings[] = {
> [PERR_REMOTE] = "Have remote partition underneath",
> };
>
> +/*
> + * CPUSET Locking Convention
> + * -------------------------
> + *
> + * Below are the three global locks guarding cpuset structures in lock
> + * acquisition order:
> + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
> + * - cpuset_mutex
> + * - callback_lock (raw spinlock)
> + *
> + * A task must hold all the three locks to modify externally visible or
> + * used fields of cpusets, though some of the internally used cpuset fields
> + * and internal variables can be modified without holding callback_lock. If only
> + * reliable read access of the externally used fields are needed, a task can
> + * hold either cpuset_mutex or callback_lock which are exposed to other
> + * external subsystems.
> + *
> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
> + * ensuring that it is the only task able to also acquire callback_lock and
> + * be able to modify cpusets. It can perform various checks on the cpuset
> + * structure first, knowing nothing will change. It can also allocate memory
> + * without holding callback_lock. While it is performing these checks, various
> + * callback routines can briefly acquire callback_lock to query cpusets. Once
> + * it is ready to make the changes, it takes callback_lock, blocking everyone
> + * else.
> + *
> + * Calls to the kernel memory allocator cannot be made while holding
> + * callback_lock which is a spinlock, as the memory allocator may sleep or
> + * call back into cpuset code and acquire callback_lock.
> + *
> + * Now, the task_struct fields mems_allowed and mempolicy may be changed
> + * by other task, we use alloc_lock in the task_struct fields to protect
> + * them.
> + *
> + * The cpuset_common_seq_show() handlers only hold callback_lock across
> + * small pieces of code, such as when reading out possibly multi-word
> + * cpumasks and nodemasks.
> + */
> +
> +static DEFINE_MUTEX(cpuset_mutex);
> +
> +/*
> + * File level internal variables below follow one of the following exclusion
> + * rules.
> + *
> + * RWCS: Read/write-able by holding either cpus_write_lock (and optionally
> + * cpuset_mutex) or both cpus_read_lock and cpuset_mutex.
> + *
> + * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
> + * by holding both cpuset_mutex and callback_lock.
> + */
> +
> /*
> * For local partitions, update to subpartitions_cpus & isolated_cpus is done
> * in update_parent_effective_cpumask(). For remote partitions, it is done in
> @@ -70,19 +122,18 @@ static const char * const perr_strings[] = {
> * Exclusive CPUs distributed out to local or remote sub-partitions of
> * top_cpuset
> */
> -static cpumask_var_t subpartitions_cpus;
> +static cpumask_var_t subpartitions_cpus; /* RWCS */
>
> /*
> - * Exclusive CPUs in isolated partitions
> + * Exclusive CPUs in isolated partitions (shown in cpuset.cpus.isolated)
> */
> -static cpumask_var_t isolated_cpus;
> +static cpumask_var_t isolated_cpus; /* CSCB */
>
> /*
> - * isolated_cpus updating flag (protected by cpuset_mutex)
> - * Set if isolated_cpus is going to be updated in the current
> - * cpuset_mutex crtical section.
> + * Set if isolated_cpus is being updated in the current cpuset_mutex
> + * critical section.
> */
> -static bool isolated_cpus_updating;
> +static bool isolated_cpus_updating; /* RWCS */
>
> /*
> * A flag to force sched domain rebuild at the end of an operation.
> @@ -98,7 +149,7 @@ static bool isolated_cpus_updating;
> * Note that update_relax_domain_level() in cpuset-v1.c can still call
> * rebuild_sched_domains_locked() directly without using this flag.
> */
> -static bool force_sd_rebuild;
> +static bool force_sd_rebuild; /* RWCS */
>
> /*
> * Partition root states:
> @@ -218,42 +269,6 @@ struct cpuset top_cpuset = {
> .partition_root_state = PRS_ROOT,
> };
>
> -/*
> - * There are two global locks guarding cpuset structures - cpuset_mutex and
> - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
> - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
> - * structures. Note that cpuset_mutex needs to be a mutex as it is used in
> - * paths that rely on priority inheritance (e.g. scheduler - on RT) for
> - * correctness.
> - *
> - * A task must hold both locks to modify cpusets. If a task holds
> - * cpuset_mutex, it blocks others, ensuring that it is the only task able to
> - * also acquire callback_lock and be able to modify cpusets. It can perform
> - * various checks on the cpuset structure first, knowing nothing will change.
> - * It can also allocate memory while just holding cpuset_mutex. While it is
> - * performing these checks, various callback routines can briefly acquire
> - * callback_lock to query cpusets. Once it is ready to make the changes, it
> - * takes callback_lock, blocking everyone else.
> - *
> - * Calls to the kernel memory allocator can not be made while holding
> - * callback_lock, as that would risk double tripping on callback_lock
> - * from one of the callbacks into the cpuset code from within
> - * __alloc_pages().
> - *
> - * If a task is only holding callback_lock, then it has read-only
> - * access to cpusets.
> - *
> - * Now, the task_struct fields mems_allowed and mempolicy may be changed
> - * by other task, we use alloc_lock in the task_struct fields to protect
> - * them.
> - *
> - * The cpuset_common_seq_show() handlers only hold callback_lock across
> - * small pieces of code, such as when reading out possibly multi-word
> - * cpumasks and nodemasks.
> - */
> -
> -static DEFINE_MUTEX(cpuset_mutex);
> -
> /**
> * cpuset_lock - Acquire the global cpuset mutex
> *
> @@ -1163,6 +1178,8 @@ static void reset_partition_data(struct cpuset *cs)
> static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus)
> {
> WARN_ON_ONCE(old_prs == new_prs);
> + lockdep_assert_held(&callback_lock);
> + lockdep_assert_held(&cpuset_mutex);
> if (new_prs == PRS_ISOLATED)
> cpumask_or(isolated_cpus, isolated_cpus, xcpus);
> else
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 3/6] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
2026-02-12 16:46 [PATCH v5 0/6] cgroup/cpuset: Fix partition related locking issues Waiman Long
2026-02-12 16:46 ` [PATCH v5 1/6] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
2026-02-12 16:46 ` [PATCH v5 2/6] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
@ 2026-02-12 16:46 ` Waiman Long
2026-02-13 2:06 ` Chen Ridong
2026-02-12 16:46 ` [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug Waiman Long
` (2 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-02-12 16:46 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is
a change in the set of isolated CPUs, making this change is now more
costly than before. Right now, the isolated_cpus_updating flag can be
set even if there is no real change in isolated_cpus. Put in additional
checks to make sure that isolated_cpus_updating is set only if there
is a real change in isolated_cpus.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e55855269432..c792380f9b60 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1180,11 +1180,15 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
WARN_ON_ONCE(old_prs == new_prs);
lockdep_assert_held(&callback_lock);
lockdep_assert_held(&cpuset_mutex);
- if (new_prs == PRS_ISOLATED)
+ if (new_prs == PRS_ISOLATED) {
+ if (cpumask_subset(xcpus, isolated_cpus))
+ return;
cpumask_or(isolated_cpus, isolated_cpus, xcpus);
- else
+ } else {
+ if (!cpumask_intersects(xcpus, isolated_cpus))
+ return;
cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
-
+ }
isolated_cpus_updating = true;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 3/6] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
2026-02-12 16:46 ` [PATCH v5 3/6] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
@ 2026-02-13 2:06 ` Chen Ridong
0 siblings, 0 replies; 16+ messages in thread
From: Chen Ridong @ 2026-02-13 2:06 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2026/2/13 0:46, Waiman Long wrote:
> As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is
> a change in the set of isolated CPUs, making this change is now more
> costly than before. Right now, the isolated_cpus_updating flag can be
> set even if there is no real change in isolated_cpus. Put in additional
> checks to make sure that isolated_cpus_updating is set only if there
> is a real change in isolated_cpus.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index e55855269432..c792380f9b60 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1180,11 +1180,15 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
> WARN_ON_ONCE(old_prs == new_prs);
> lockdep_assert_held(&callback_lock);
> lockdep_assert_held(&cpuset_mutex);
> - if (new_prs == PRS_ISOLATED)
> + if (new_prs == PRS_ISOLATED) {
> + if (cpumask_subset(xcpus, isolated_cpus))
> + return;
> cpumask_or(isolated_cpus, isolated_cpus, xcpus);
> - else
> + } else {
> + if (!cpumask_intersects(xcpus, isolated_cpus))
> + return;
> cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
> -
> + }
> isolated_cpus_updating = true;
> }
>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug
2026-02-12 16:46 [PATCH v5 0/6] cgroup/cpuset: Fix partition related locking issues Waiman Long
` (2 preceding siblings ...)
2026-02-12 16:46 ` [PATCH v5 3/6] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
@ 2026-02-12 16:46 ` Waiman Long
2026-02-13 3:28 ` Chen Ridong
2026-02-13 6:56 ` Chen Ridong
2026-02-12 16:46 ` [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
2026-02-12 16:46 ` [PATCH v5 6/6] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Waiman Long
5 siblings, 2 replies; 16+ messages in thread
From: Waiman Long @ 2026-02-12 16:46 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
As any change to isolated_cpus is going to be propagated to the
HK_TYPE_DOMAIN housekeeping cpumask, it can be problematic if
housekeeping cpumasks are directly being modified from the CPU hotplug
code path. This is especially the case if we are going to enable dynamic
update to the nohz_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE)
in the near future with the help of CPU hotplug.
Avoid these potential problems by changing the cpuset code to not
updating isolated_cpus when calling from CPU hotplug. A new special
PRS_INVALID_ISOLCPUS is added to indicate the current cpuset is an
invalid partition but its effective_xcpus are still in isolated_cpus.
This special state will be set if an isolated partition becomes invalid
due to the shutdown of the last active CPU in that partition. We also
need to keep the effective_xcpus even if exclusive_cpus isn't set.
When changes are made to "cpuset.cpus", "cpuset.cpus.exclusive" or
"cpuset.cpus.partition" of a PRS_INVALID_ISOLCPUS cpuset, its state
will be reset back to PRS_INVALID_ISOLATED and its effective_xcpus will
be removed from isolated_cpus before proceeding.
As CPU hotplug will no longer update isolated_cpus, some of the test
cases in test_cpuset_prs.h will have to be updated to match the new
expected results. Some new test cases are also added to confirm that
"cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will
both be updated.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 85 ++++++++++++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 21 +++--
2 files changed, 87 insertions(+), 19 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c792380f9b60..48b7f275085b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -159,6 +159,8 @@ static bool force_sd_rebuild; /* RWCS */
* 2 - partition root without load balancing (isolated)
* -1 - invalid partition root
* -2 - invalid isolated partition root
+ * -3 - invalid isolated partition root but with effective xcpus still
+ * in isolated_cpus (set from CPU hotplug side)
*
* There are 2 types of partitions - local or remote. Local partitions are
* those whose parents are partition root themselves. Setting of
@@ -187,6 +189,7 @@ static bool force_sd_rebuild; /* RWCS */
#define PRS_ISOLATED 2
#define PRS_INVALID_ROOT -1
#define PRS_INVALID_ISOLATED -2
+#define PRS_INVALID_ISOLCPUS -3 /* Effective xcpus still in isolated_cpus */
/*
* Temporary cpumasks for working with partitions that are passed among
@@ -382,6 +385,30 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
}
+/*
+ * If the given cpuset has a partition state of PRS_INVALID_ISOLCPUS,
+ * remove its effective_xcpus from isolated_cpus and reset its state to
+ * PRS_INVALID_ISOLATED. Also clear effective_xcpus if exclusive_cpus is
+ * empty.
+ */
+static void fix_invalid_isolcpus(struct cpuset *cs, struct cpuset *trialcs)
+{
+ if (likely(cs->partition_root_state != PRS_INVALID_ISOLCPUS))
+ return;
+ WARN_ON_ONCE(cpumask_empty(cs->effective_xcpus));
+ spin_lock_irq(&callback_lock);
+ cpumask_andnot(isolated_cpus, isolated_cpus, cs->effective_xcpus);
+ if (cpumask_empty(cs->exclusive_cpus))
+ cpumask_clear(cs->effective_xcpus);
+ cs->partition_root_state = PRS_INVALID_ISOLATED;
+ spin_unlock_irq(&callback_lock);
+ isolated_cpus_updating = true;
+ if (trialcs) {
+ trialcs->partition_root_state = PRS_INVALID_ISOLATED;
+ cpumask_copy(trialcs->effective_xcpus, cs->effective_xcpus);
+ }
+}
+
/**
* partition_is_populated - check if partition has tasks
* @cs: partition root to be checked
@@ -1160,7 +1187,8 @@ static void reset_partition_data(struct cpuset *cs)
lockdep_assert_held(&callback_lock);
- if (cpumask_empty(cs->exclusive_cpus)) {
+ if (cpumask_empty(cs->exclusive_cpus) &&
+ (cs->partition_root_state != PRS_INVALID_ISOLCPUS)) {
cpumask_clear(cs->effective_xcpus);
if (is_cpu_exclusive(cs))
clear_bit(CS_CPU_EXCLUSIVE, &cs->flags);
@@ -1189,6 +1217,10 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
return;
cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
}
+ /*
+ * Shouldn't update isolated_cpus from CPU hotplug
+ */
+ WARN_ON_ONCE(current->flags & PF_KTHREAD);
isolated_cpus_updating = true;
}
@@ -1208,7 +1240,6 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
if (!parent)
parent = &top_cpuset;
-
if (parent == &top_cpuset)
cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
@@ -1224,11 +1255,12 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
* @old_prs: old partition_root_state
* @parent: parent cpuset
* @xcpus: exclusive CPUs to be removed
+ * @no_isolcpus: don't update isolated_cpus
*
* Remote partition if parent == NULL
*/
static void partition_xcpus_del(int old_prs, struct cpuset *parent,
- struct cpumask *xcpus)
+ struct cpumask *xcpus, bool no_isolcpus)
{
WARN_ON_ONCE(old_prs < 0);
lockdep_assert_held(&callback_lock);
@@ -1238,7 +1270,7 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
if (parent == &top_cpuset)
cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
- if (old_prs != parent->partition_root_state)
+ if ((old_prs != parent->partition_root_state) && !no_isolcpus)
isolated_cpus_update(old_prs, parent->partition_root_state,
xcpus);
@@ -1496,6 +1528,8 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
*/
static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
{
+ int old_prs = cs->partition_root_state;
+
WARN_ON_ONCE(!is_remote_partition(cs));
/*
* When a CPU is offlined, top_cpuset may end up with no available CPUs,
@@ -1508,14 +1542,24 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
spin_lock_irq(&callback_lock);
cs->remote_partition = false;
- partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus);
if (cs->prs_err)
cs->partition_root_state = -cs->partition_root_state;
else
cs->partition_root_state = PRS_MEMBER;
+ /*
+ * Don't update isolated_cpus if calling from CPU hotplug kthread
+ */
+ if ((current->flags & PF_KTHREAD) &&
+ (cs->partition_root_state == PRS_INVALID_ISOLATED))
+ cs->partition_root_state = PRS_INVALID_ISOLCPUS;
- /* effective_xcpus may need to be changed */
- compute_excpus(cs, cs->effective_xcpus);
+ partition_xcpus_del(old_prs, NULL, cs->effective_xcpus,
+ cs->partition_root_state == PRS_INVALID_ISOLCPUS);
+ /*
+ * effective_xcpus may need to be changed
+ */
+ if (cs->partition_root_state != PRS_INVALID_ISOLCPUS)
+ compute_excpus(cs, cs->effective_xcpus);
reset_partition_data(cs);
spin_unlock_irq(&callback_lock);
update_isolation_cpumasks();
@@ -1580,7 +1624,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
if (adding)
partition_xcpus_add(prs, NULL, tmp->addmask);
if (deleting)
- partition_xcpus_del(prs, NULL, tmp->delmask);
+ partition_xcpus_del(prs, NULL, tmp->delmask, false);
/*
* Need to update effective_xcpus and exclusive_cpus now as
* update_sibling_cpumasks() below may iterate back to the same cs.
@@ -1893,6 +1937,10 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
if (!part_error)
new_prs = -old_prs;
break;
+ case PRS_INVALID_ISOLCPUS:
+ if (!part_error)
+ new_prs = PRS_ISOLATED;
+ break;
}
}
@@ -1923,12 +1971,19 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
if (old_prs != new_prs)
cs->partition_root_state = new_prs;
+ /*
+ * Don't update isolated_cpus if calling from CPU hotplug kthread
+ */
+ if ((current->flags & PF_KTHREAD) &&
+ (cs->partition_root_state == PRS_INVALID_ISOLATED))
+ cs->partition_root_state = PRS_INVALID_ISOLCPUS;
/*
* Adding to parent's effective_cpus means deletion CPUs from cs
* and vice versa.
*/
if (adding)
- partition_xcpus_del(old_prs, parent, tmp->addmask);
+ partition_xcpus_del(old_prs, parent, tmp->addmask,
+ cs->partition_root_state == PRS_INVALID_ISOLCPUS);
if (deleting)
partition_xcpus_add(new_prs, parent, tmp->delmask);
@@ -2317,6 +2372,7 @@ static void partition_cpus_change(struct cpuset *cs, struct cpuset *trialcs,
if (cs_is_member(cs))
return;
+ fix_invalid_isolcpus(cs, trialcs);
prs_err = validate_partition(cs, trialcs);
if (prs_err)
trialcs->prs_err = cs->prs_err = prs_err;
@@ -2818,6 +2874,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
if (alloc_tmpmasks(&tmpmask))
return -ENOMEM;
+ fix_invalid_isolcpus(cs, NULL);
err = update_partition_exclusive_flag(cs, new_prs);
if (err)
goto out;
@@ -3268,6 +3325,7 @@ static int cpuset_partition_show(struct seq_file *seq, void *v)
type = "root";
fallthrough;
case PRS_INVALID_ISOLATED:
+ case PRS_INVALID_ISOLCPUS:
if (!type)
type = "isolated";
err = perr_strings[READ_ONCE(cs->prs_err)];
@@ -3463,9 +3521,9 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
}
/*
- * If a dying cpuset has the 'cpus.partition' enabled, turn it off by
- * changing it back to member to free its exclusive CPUs back to the pool to
- * be used by other online cpusets.
+ * If a dying cpuset has the 'cpus.partition' enabled or is in the
+ * PRS_INVALID_ISOLCPUS state, turn it off by changing it back to member to
+ * free its exclusive CPUs back to the pool to be used by other online cpusets.
*/
static void cpuset_css_killed(struct cgroup_subsys_state *css)
{
@@ -3473,7 +3531,8 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
cpuset_full_lock();
/* Reset valid partition back to member */
- if (is_partition_valid(cs))
+ if (is_partition_valid(cs) ||
+ (cs->partition_root_state == PRS_INVALID_ISOLCPUS))
update_prstate(cs, PRS_MEMBER);
cpuset_full_unlock();
}
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 5dff3ad53867..380506157f70 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -234,6 +234,7 @@ TEST_MATRIX=(
"$SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
# CPU offlining cases:
+ # cpuset.cpus.isolated should no longer be updated.
" C0-1 . . C2-3 S+ C4-5 . O2=0 0 A1:0-1|B1:3"
"C0-3:P1:S+ C2-3:P1 . . O2=0 . . . 0 A1:0-1|A2:3"
"C0-3:P1:S+ C2-3:P1 . . O2=0 O2=1 . . 0 A1:0-1|A2:2-3"
@@ -245,8 +246,9 @@ TEST_MATRIX=(
"C2-3:P1:S+ C3:P2 . . O2=0 O2=1 . . 0 A1:2|A2:3 A1:P1|A2:P2"
"C2-3:P1:S+ C3:P1 . . O2=0 . . . 0 A1:|A2:3 A1:P1|A2:P1"
"C2-3:P1:S+ C3:P1 . . O3=0 . . . 0 A1:2|A2: A1:P1|A2:P1"
- "C2-3:P1:S+ C3:P1 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-1"
- "C2-3:P1:S+ C3:P1 . . . T:O3=0 . . 0 A1:2|A2:2 A1:P1|A2:P-1"
+ "C2-3:P1:S+ C3:P2 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-2"
+ "C1-3:P1:S+ C3:P2 . . . T:O3=0 . . 0 A1:1-2|A2:1-2|XA2:3 A1:P1|A2:P-2 3"
+ "C1-3:P1:S+ C3:P2 . . . T:O3=0 O3=1 . 0 A1:1-2|A2:3|XA2:3 A1:P1|A2:P2 3"
"$SETUP_A123_PARTITIONS . O1=0 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
"$SETUP_A123_PARTITIONS . O2=0 . . . 0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
"$SETUP_A123_PARTITIONS . O3=0 . . . 0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -299,13 +301,14 @@ TEST_MATRIX=(
A1:P0|A2:P2|A3:P-1 2-4"
# Remote partition offline tests
+ # CPU offline shouldn't change cpuset.cpus.{isolated,exclusive.effective}
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 . 0 A1:0-1|A2:1|A3:3 A1:P0|A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
- " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2|A2:1-2|A3: A1:P0|A3:P2 3"
- " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2|A2:1-2|A3:1-2 A1:P0|A3:P-2 3|"
+ " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2|A2:1-2|A3:|XA3:3 A1:P0|A3:P2 3"
+ " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2|A2:1-2|A3:1-2|XA3:3 A1:P0|A3:P-2 3"
# An invalidated remote partition cannot self-recover from hotplug
- " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2 A1:P0|A3:P-2 ."
+ " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2|XA3:2 A1:P0|A3:P-2 2"
# cpus.exclusive.effective clearing test
" C0-3:S+ C1-3:S+ C2 . X2-3:X . . . 0 A1:0-3|A2:1-3|A3:2|XA1:"
@@ -764,7 +767,7 @@ check_cgroup_states()
# only CPUs in isolated partitions as well as those that are isolated at
# boot time.
#
-# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
# <isolcpus1> - expected sched/domains value
# <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
#
@@ -773,6 +776,7 @@ check_isolcpus()
EXPECTED_ISOLCPUS=$1
ISCPUS=${CGROUP2}/cpuset.cpus.isolated
ISOLCPUS=$(cat $ISCPUS)
+ HKICPUS=$(cat /sys/devices/system/cpu/isolated)
LASTISOLCPU=
SCHED_DOMAINS=/sys/kernel/debug/sched/domains
if [[ $EXPECTED_ISOLCPUS = . ]]
@@ -810,6 +814,11 @@ check_isolcpus()
ISOLCPUS=
EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
+ #
+ # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
+ #
+ [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
+
#
# Use the sched domain in debugfs to check isolated CPUs, if available
#
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug
2026-02-12 16:46 ` [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug Waiman Long
@ 2026-02-13 3:28 ` Chen Ridong
2026-02-13 15:28 ` Waiman Long
2026-02-13 6:56 ` Chen Ridong
1 sibling, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2026-02-13 3:28 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2026/2/13 0:46, Waiman Long wrote:
> As any change to isolated_cpus is going to be propagated to the
> HK_TYPE_DOMAIN housekeeping cpumask, it can be problematic if
> housekeeping cpumasks are directly being modified from the CPU hotplug
> code path. This is especially the case if we are going to enable dynamic
> update to the nohz_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE)
> in the near future with the help of CPU hotplug.
>
> Avoid these potential problems by changing the cpuset code to not
> updating isolated_cpus when calling from CPU hotplug. A new special
> PRS_INVALID_ISOLCPUS is added to indicate the current cpuset is an
> invalid partition but its effective_xcpus are still in isolated_cpus.
> This special state will be set if an isolated partition becomes invalid
> due to the shutdown of the last active CPU in that partition. We also
> need to keep the effective_xcpus even if exclusive_cpus isn't set.
>
> When changes are made to "cpuset.cpus", "cpuset.cpus.exclusive" or
> "cpuset.cpus.partition" of a PRS_INVALID_ISOLCPUS cpuset, its state
> will be reset back to PRS_INVALID_ISOLATED and its effective_xcpus will
> be removed from isolated_cpus before proceeding.
>
> As CPU hotplug will no longer update isolated_cpus, some of the test
> cases in test_cpuset_prs.h will have to be updated to match the new
> expected results. Some new test cases are also added to confirm that
> "cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will
> both be updated.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 85 ++++++++++++++++---
> .../selftests/cgroup/test_cpuset_prs.sh | 21 +++--
> 2 files changed, 87 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index c792380f9b60..48b7f275085b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -159,6 +159,8 @@ static bool force_sd_rebuild; /* RWCS */
> * 2 - partition root without load balancing (isolated)
> * -1 - invalid partition root
> * -2 - invalid isolated partition root
> + * -3 - invalid isolated partition root but with effective xcpus still
> + * in isolated_cpus (set from CPU hotplug side)
> *
> * There are 2 types of partitions - local or remote. Local partitions are
> * those whose parents are partition root themselves. Setting of
> @@ -187,6 +189,7 @@ static bool force_sd_rebuild; /* RWCS */
> #define PRS_ISOLATED 2
> #define PRS_INVALID_ROOT -1
> #define PRS_INVALID_ISOLATED -2
> +#define PRS_INVALID_ISOLCPUS -3 /* Effective xcpus still in isolated_cpus */
>
> /*
> * Temporary cpumasks for working with partitions that are passed among
> @@ -382,6 +385,30 @@ static inline bool is_in_v2_mode(void)
> (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
> }
>
> +/*
> + * If the given cpuset has a partition state of PRS_INVALID_ISOLCPUS,
> + * remove its effective_xcpus from isolated_cpus and reset its state to
> + * PRS_INVALID_ISOLATED. Also clear effective_xcpus if exclusive_cpus is
> + * empty.
> + */
> +static void fix_invalid_isolcpus(struct cpuset *cs, struct cpuset *trialcs)
> +{
> + if (likely(cs->partition_root_state != PRS_INVALID_ISOLCPUS))
> + return;
> + WARN_ON_ONCE(cpumask_empty(cs->effective_xcpus));
> + spin_lock_irq(&callback_lock);
> + cpumask_andnot(isolated_cpus, isolated_cpus, cs->effective_xcpus);
> + if (cpumask_empty(cs->exclusive_cpus))
> + cpumask_clear(cs->effective_xcpus);
> + cs->partition_root_state = PRS_INVALID_ISOLATED;
> + spin_unlock_irq(&callback_lock);
> + isolated_cpus_updating = true;
> + if (trialcs) {
> + trialcs->partition_root_state = PRS_INVALID_ISOLATED;
> + cpumask_copy(trialcs->effective_xcpus, cs->effective_xcpus);
> + }
> +}
When fix_invalid_isolcpus is called from changing cpus/exclusive cpus, should we
copy cs->effective_xcpus to trialcs->effective_xcpus?
I tested as follow steps(using the whole series):
# cd /sys/fs/cgroup/
# mkdir test
# echo 1 > cpuset.cpus.
# cd test/
# echo 1 > cpuset.cpus.exclusive
# echo $$ > cgroup.procs
# echo isolated > cpuset.cpus.partition
# cat cpuset.cpus.partition
isolated
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cat cpuset.cpus.partition
isolated invalid
# echo 2 > cpuset.cpus.exclusive
# cat cpuset.cpus.partition
isolated invalid (Parent unable to distribute cpu downstream)
After changing cpuset.cpus.exclusive to 2, the test cpuset should
become valid again, but it remains invalid.
> /**
> * partition_is_populated - check if partition has tasks
> * @cs: partition root to be checked
> @@ -1160,7 +1187,8 @@ static void reset_partition_data(struct cpuset *cs)
>
> lockdep_assert_held(&callback_lock);
>
> - if (cpumask_empty(cs->exclusive_cpus)) {
> + if (cpumask_empty(cs->exclusive_cpus) &&
> + (cs->partition_root_state != PRS_INVALID_ISOLCPUS)) {
> cpumask_clear(cs->effective_xcpus);
> if (is_cpu_exclusive(cs))
> clear_bit(CS_CPU_EXCLUSIVE, &cs->flags);
> @@ -1189,6 +1217,10 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
> return;
> cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
> }
> + /*
> + * Shouldn't update isolated_cpus from CPU hotplug
> + */
> + WARN_ON_ONCE(current->flags & PF_KTHREAD);
> isolated_cpus_updating = true;
> }
>
> @@ -1208,7 +1240,6 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
> if (!parent)
> parent = &top_cpuset;
>
> -
> if (parent == &top_cpuset)
> cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
>
> @@ -1224,11 +1255,12 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
> * @old_prs: old partition_root_state
> * @parent: parent cpuset
> * @xcpus: exclusive CPUs to be removed
> + * @no_isolcpus: don't update isolated_cpus
> *
> * Remote partition if parent == NULL
> */
> static void partition_xcpus_del(int old_prs, struct cpuset *parent,
> - struct cpumask *xcpus)
> + struct cpumask *xcpus, bool no_isolcpus)
> {
> WARN_ON_ONCE(old_prs < 0);
> lockdep_assert_held(&callback_lock);
> @@ -1238,7 +1270,7 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
> if (parent == &top_cpuset)
> cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
>
> - if (old_prs != parent->partition_root_state)
> + if ((old_prs != parent->partition_root_state) && !no_isolcpus)
> isolated_cpus_update(old_prs, parent->partition_root_state,
> xcpus);
>
> @@ -1496,6 +1528,8 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
> */
> static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
> {
> + int old_prs = cs->partition_root_state;
> +
> WARN_ON_ONCE(!is_remote_partition(cs));
> /*
> * When a CPU is offlined, top_cpuset may end up with no available CPUs,
> @@ -1508,14 +1542,24 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
>
> spin_lock_irq(&callback_lock);
> cs->remote_partition = false;
> - partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus);
> if (cs->prs_err)
> cs->partition_root_state = -cs->partition_root_state;
> else
> cs->partition_root_state = PRS_MEMBER;
> + /*
> + * Don't update isolated_cpus if calling from CPU hotplug kthread
> + */
> + if ((current->flags & PF_KTHREAD) &&
> + (cs->partition_root_state == PRS_INVALID_ISOLATED))
> + cs->partition_root_state = PRS_INVALID_ISOLCPUS;
>
> - /* effective_xcpus may need to be changed */
> - compute_excpus(cs, cs->effective_xcpus);
> + partition_xcpus_del(old_prs, NULL, cs->effective_xcpus,
> + cs->partition_root_state == PRS_INVALID_ISOLCPUS);
> + /*
> + * effective_xcpus may need to be changed
> + */
> + if (cs->partition_root_state != PRS_INVALID_ISOLCPUS)
> + compute_excpus(cs, cs->effective_xcpus);
> reset_partition_data(cs);
> spin_unlock_irq(&callback_lock);
> update_isolation_cpumasks();
> @@ -1580,7 +1624,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
> if (adding)
> partition_xcpus_add(prs, NULL, tmp->addmask);
> if (deleting)
> - partition_xcpus_del(prs, NULL, tmp->delmask);
> + partition_xcpus_del(prs, NULL, tmp->delmask, false);
> /*
> * Need to update effective_xcpus and exclusive_cpus now as
> * update_sibling_cpumasks() below may iterate back to the same cs.
> @@ -1893,6 +1937,10 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
> if (!part_error)
> new_prs = -old_prs;
> break;
> + case PRS_INVALID_ISOLCPUS:
> + if (!part_error)
> + new_prs = PRS_ISOLATED;
> + break;
> }
> }
>
> @@ -1923,12 +1971,19 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
> if (old_prs != new_prs)
> cs->partition_root_state = new_prs;
>
> + /*
> + * Don't update isolated_cpus if calling from CPU hotplug kthread
> + */
> + if ((current->flags & PF_KTHREAD) &&
> + (cs->partition_root_state == PRS_INVALID_ISOLATED))
> + cs->partition_root_state = PRS_INVALID_ISOLCPUS;
> /*
> * Adding to parent's effective_cpus means deletion CPUs from cs
> * and vice versa.
> */
> if (adding)
> - partition_xcpus_del(old_prs, parent, tmp->addmask);
> + partition_xcpus_del(old_prs, parent, tmp->addmask,
> + cs->partition_root_state == PRS_INVALID_ISOLCPUS);
> if (deleting)
> partition_xcpus_add(new_prs, parent, tmp->delmask);
>
> @@ -2317,6 +2372,7 @@ static void partition_cpus_change(struct cpuset *cs, struct cpuset *trialcs,
> if (cs_is_member(cs))
> return;
>
> + fix_invalid_isolcpus(cs, trialcs);
> prs_err = validate_partition(cs, trialcs);
> if (prs_err)
> trialcs->prs_err = cs->prs_err = prs_err;
> @@ -2818,6 +2874,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> if (alloc_tmpmasks(&tmpmask))
> return -ENOMEM;
>
> + fix_invalid_isolcpus(cs, NULL);
> err = update_partition_exclusive_flag(cs, new_prs);
> if (err)
> goto out;
> @@ -3268,6 +3325,7 @@ static int cpuset_partition_show(struct seq_file *seq, void *v)
> type = "root";
> fallthrough;
> case PRS_INVALID_ISOLATED:
> + case PRS_INVALID_ISOLCPUS:
> if (!type)
> type = "isolated";
> err = perr_strings[READ_ONCE(cs->prs_err)];
> @@ -3463,9 +3521,9 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
> }
>
> /*
> - * If a dying cpuset has the 'cpus.partition' enabled, turn it off by
> - * changing it back to member to free its exclusive CPUs back to the pool to
> - * be used by other online cpusets.
> + * If a dying cpuset has the 'cpus.partition' enabled or is in the
> + * PRS_INVALID_ISOLCPUS state, turn it off by changing it back to member to
> + * free its exclusive CPUs back to the pool to be used by other online cpusets.
> */
> static void cpuset_css_killed(struct cgroup_subsys_state *css)
> {
> @@ -3473,7 +3531,8 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
>
> cpuset_full_lock();
> /* Reset valid partition back to member */
> - if (is_partition_valid(cs))
> + if (is_partition_valid(cs) ||
> + (cs->partition_root_state == PRS_INVALID_ISOLCPUS))
> update_prstate(cs, PRS_MEMBER);
> cpuset_full_unlock();
> }
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 5dff3ad53867..380506157f70 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -234,6 +234,7 @@ TEST_MATRIX=(
> "$SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>
> # CPU offlining cases:
> + # cpuset.cpus.isolated should no longer be updated.
> " C0-1 . . C2-3 S+ C4-5 . O2=0 0 A1:0-1|B1:3"
> "C0-3:P1:S+ C2-3:P1 . . O2=0 . . . 0 A1:0-1|A2:3"
> "C0-3:P1:S+ C2-3:P1 . . O2=0 O2=1 . . 0 A1:0-1|A2:2-3"
> @@ -245,8 +246,9 @@ TEST_MATRIX=(
> "C2-3:P1:S+ C3:P2 . . O2=0 O2=1 . . 0 A1:2|A2:3 A1:P1|A2:P2"
> "C2-3:P1:S+ C3:P1 . . O2=0 . . . 0 A1:|A2:3 A1:P1|A2:P1"
> "C2-3:P1:S+ C3:P1 . . O3=0 . . . 0 A1:2|A2: A1:P1|A2:P1"
> - "C2-3:P1:S+ C3:P1 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-1"
> - "C2-3:P1:S+ C3:P1 . . . T:O3=0 . . 0 A1:2|A2:2 A1:P1|A2:P-1"
> + "C2-3:P1:S+ C3:P2 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-2"
> + "C1-3:P1:S+ C3:P2 . . . T:O3=0 . . 0 A1:1-2|A2:1-2|XA2:3 A1:P1|A2:P-2 3"
> + "C1-3:P1:S+ C3:P2 . . . T:O3=0 O3=1 . 0 A1:1-2|A2:3|XA2:3 A1:P1|A2:P2 3"
> "$SETUP_A123_PARTITIONS . O1=0 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
> "$SETUP_A123_PARTITIONS . O2=0 . . . 0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
> "$SETUP_A123_PARTITIONS . O3=0 . . . 0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -299,13 +301,14 @@ TEST_MATRIX=(
> A1:P0|A2:P2|A3:P-1 2-4"
>
> # Remote partition offline tests
> + # CPU offline shouldn't change cpuset.cpus.{isolated,exclusive.effective}
> " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 . 0 A1:0-1|A2:1|A3:3 A1:P0|A3:P2 2-3"
> " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
> - " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2|A2:1-2|A3: A1:P0|A3:P2 3"
> - " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2|A2:1-2|A3:1-2 A1:P0|A3:P-2 3|"
> + " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2|A2:1-2|A3:|XA3:3 A1:P0|A3:P2 3"
> + " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2|A2:1-2|A3:1-2|XA3:3 A1:P0|A3:P-2 3"
>
> # An invalidated remote partition cannot self-recover from hotplug
> - " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2 A1:P0|A3:P-2 ."
> + " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2|XA3:2 A1:P0|A3:P-2 2"
>
> # cpus.exclusive.effective clearing test
> " C0-3:S+ C1-3:S+ C2 . X2-3:X . . . 0 A1:0-3|A2:1-3|A3:2|XA1:"
> @@ -764,7 +767,7 @@ check_cgroup_states()
> # only CPUs in isolated partitions as well as those that are isolated at
> # boot time.
> #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
> # <isolcpus1> - expected sched/domains value
> # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
> #
> @@ -773,6 +776,7 @@ check_isolcpus()
> EXPECTED_ISOLCPUS=$1
> ISCPUS=${CGROUP2}/cpuset.cpus.isolated
> ISOLCPUS=$(cat $ISCPUS)
> + HKICPUS=$(cat /sys/devices/system/cpu/isolated)
> LASTISOLCPU=
> SCHED_DOMAINS=/sys/kernel/debug/sched/domains
> if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -810,6 +814,11 @@ check_isolcpus()
> ISOLCPUS=
> EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>
> + #
> + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> + #
> + [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
> #
> # Use the sched domain in debugfs to check isolated CPUs, if available
> #
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug
2026-02-13 3:28 ` Chen Ridong
@ 2026-02-13 15:28 ` Waiman Long
0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-02-13 15:28 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2/12/26 10:28 PM, Chen Ridong wrote:
>
> On 2026/2/13 0:46, Waiman Long wrote:
>> As any change to isolated_cpus is going to be propagated to the
>> HK_TYPE_DOMAIN housekeeping cpumask, it can be problematic if
>> housekeeping cpumasks are directly being modified from the CPU hotplug
>> code path. This is especially the case if we are going to enable dynamic
>> update to the nohz_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE)
>> in the near future with the help of CPU hotplug.
>>
>> Avoid these potential problems by changing the cpuset code to not
>> updating isolated_cpus when calling from CPU hotplug. A new special
>> PRS_INVALID_ISOLCPUS is added to indicate the current cpuset is an
>> invalid partition but its effective_xcpus are still in isolated_cpus.
>> This special state will be set if an isolated partition becomes invalid
>> due to the shutdown of the last active CPU in that partition. We also
>> need to keep the effective_xcpus even if exclusive_cpus isn't set.
>>
>> When changes are made to "cpuset.cpus", "cpuset.cpus.exclusive" or
>> "cpuset.cpus.partition" of a PRS_INVALID_ISOLCPUS cpuset, its state
>> will be reset back to PRS_INVALID_ISOLATED and its effective_xcpus will
>> be removed from isolated_cpus before proceeding.
>>
>> As CPU hotplug will no longer update isolated_cpus, some of the test
>> cases in test_cpuset_prs.h will have to be updated to match the new
>> expected results. Some new test cases are also added to confirm that
>> "cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will
>> both be updated.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 85 ++++++++++++++++---
>> .../selftests/cgroup/test_cpuset_prs.sh | 21 +++--
>> 2 files changed, 87 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index c792380f9b60..48b7f275085b 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -159,6 +159,8 @@ static bool force_sd_rebuild; /* RWCS */
>> * 2 - partition root without load balancing (isolated)
>> * -1 - invalid partition root
>> * -2 - invalid isolated partition root
>> + * -3 - invalid isolated partition root but with effective xcpus still
>> + * in isolated_cpus (set from CPU hotplug side)
>> *
>> * There are 2 types of partitions - local or remote. Local partitions are
>> * those whose parents are partition root themselves. Setting of
>> @@ -187,6 +189,7 @@ static bool force_sd_rebuild; /* RWCS */
>> #define PRS_ISOLATED 2
>> #define PRS_INVALID_ROOT -1
>> #define PRS_INVALID_ISOLATED -2
>> +#define PRS_INVALID_ISOLCPUS -3 /* Effective xcpus still in isolated_cpus */
>>
>> /*
>> * Temporary cpumasks for working with partitions that are passed among
>> @@ -382,6 +385,30 @@ static inline bool is_in_v2_mode(void)
>> (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
>> }
>>
>> +/*
>> + * If the given cpuset has a partition state of PRS_INVALID_ISOLCPUS,
>> + * remove its effective_xcpus from isolated_cpus and reset its state to
>> + * PRS_INVALID_ISOLATED. Also clear effective_xcpus if exclusive_cpus is
>> + * empty.
>> + */
>> +static void fix_invalid_isolcpus(struct cpuset *cs, struct cpuset *trialcs)
>> +{
>> + if (likely(cs->partition_root_state != PRS_INVALID_ISOLCPUS))
>> + return;
>> + WARN_ON_ONCE(cpumask_empty(cs->effective_xcpus));
>> + spin_lock_irq(&callback_lock);
>> + cpumask_andnot(isolated_cpus, isolated_cpus, cs->effective_xcpus);
>> + if (cpumask_empty(cs->exclusive_cpus))
>> + cpumask_clear(cs->effective_xcpus);
>> + cs->partition_root_state = PRS_INVALID_ISOLATED;
>> + spin_unlock_irq(&callback_lock);
>> + isolated_cpus_updating = true;
>> + if (trialcs) {
>> + trialcs->partition_root_state = PRS_INVALID_ISOLATED;
>> + cpumask_copy(trialcs->effective_xcpus, cs->effective_xcpus);
>> + }
>> +}
> When fix_invalid_isolcpus is called from changing cpus/exclusive cpus, should we
> copy cs->effective_xcpus to trialcs->effective_xcpus?
>
> I tested as follow steps(using the whole series):
>
> # cd /sys/fs/cgroup/
> # mkdir test
> # echo 1 > cpuset.cpus.
> # cd test/
> # echo 1 > cpuset.cpus.exclusive
> # echo $$ > cgroup.procs
> # echo isolated > cpuset.cpus.partition
> # cat cpuset.cpus.partition
> isolated
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> # cat cpuset.cpus.partition
> isolated invalid
> # echo 2 > cpuset.cpus.exclusive
> # cat cpuset.cpus.partition
> isolated invalid (Parent unable to distribute cpu downstream)
>
> After changing cpuset.cpus.exclusive to 2, the test cpuset should
> become valid again, but it remains invalid.
Right, changes to trialcs->effective_xcpus is unnecessary() as
compute_trialcs_excpus() will be called before fix_invalid_isolcpus() is
invoked. Will fix that in the next version.
Thanks,
Longman
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug
2026-02-12 16:46 ` [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug Waiman Long
2026-02-13 3:28 ` Chen Ridong
@ 2026-02-13 6:56 ` Chen Ridong
2026-02-21 19:18 ` Waiman Long
1 sibling, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2026-02-13 6:56 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2026/2/13 0:46, Waiman Long wrote:
> As any change to isolated_cpus is going to be propagated to the
> HK_TYPE_DOMAIN housekeeping cpumask, it can be problematic if
> housekeeping cpumasks are directly being modified from the CPU hotplug
> code path. This is especially the case if we are going to enable dynamic
> update to the nohz_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE)
> in the near future with the help of CPU hotplug.
>
> Avoid these potential problems by changing the cpuset code to not
> updating isolated_cpus when calling from CPU hotplug. A new special
> PRS_INVALID_ISOLCPUS is added to indicate the current cpuset is an
> invalid partition but its effective_xcpus are still in isolated_cpus.
> This special state will be set if an isolated partition becomes invalid
> due to the shutdown of the last active CPU in that partition. We also
> need to keep the effective_xcpus even if exclusive_cpus isn't set.
>
> When changes are made to "cpuset.cpus", "cpuset.cpus.exclusive" or
> "cpuset.cpus.partition" of a PRS_INVALID_ISOLCPUS cpuset, its state
> will be reset back to PRS_INVALID_ISOLATED and its effective_xcpus will
> be removed from isolated_cpus before proceeding.
>
> As CPU hotplug will no longer update isolated_cpus, some of the test
> cases in test_cpuset_prs.h will have to be updated to match the new
> expected results. Some new test cases are also added to confirm that
> "cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will
> both be updated.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 85 ++++++++++++++++---
> .../selftests/cgroup/test_cpuset_prs.sh | 21 +++--
> 2 files changed, 87 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index c792380f9b60..48b7f275085b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -159,6 +159,8 @@ static bool force_sd_rebuild; /* RWCS */
> * 2 - partition root without load balancing (isolated)
> * -1 - invalid partition root
> * -2 - invalid isolated partition root
> + * -3 - invalid isolated partition root but with effective xcpus still
> + * in isolated_cpus (set from CPU hotplug side)
> *
> * There are 2 types of partitions - local or remote. Local partitions are
> * those whose parents are partition root themselves. Setting of
> @@ -187,6 +189,7 @@ static bool force_sd_rebuild; /* RWCS */
> #define PRS_ISOLATED 2
> #define PRS_INVALID_ROOT -1
> #define PRS_INVALID_ISOLATED -2
> +#define PRS_INVALID_ISOLCPUS -3 /* Effective xcpus still in isolated_cpus */
>
How about adding a helper?
bool hotplug_invalidate_isolate(struct cpuset *cs)
{
if (current->flags & PF_KTHREAD) &&
(cs->partition_root_state == PRS_INVALID_ISOLATED);
}
> /*
> * Temporary cpumasks for working with partitions that are passed among
> @@ -382,6 +385,30 @@ static inline bool is_in_v2_mode(void)
> (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
> }
>
> +/*
> + * If the given cpuset has a partition state of PRS_INVALID_ISOLCPUS,
> + * remove its effective_xcpus from isolated_cpus and reset its state to
> + * PRS_INVALID_ISOLATED. Also clear effective_xcpus if exclusive_cpus is
> + * empty.
> + */
> +static void fix_invalid_isolcpus(struct cpuset *cs, struct cpuset *trialcs)
> +{
> + if (likely(cs->partition_root_state != PRS_INVALID_ISOLCPUS))
> + return;
if (likely(cs->partition_root_state != PRS_INVALID_ISOLATED ||
cpumask_empty(cs->effective_xcpus)))
return;
> + WARN_ON_ONCE(cpumask_empty(cs->effective_xcpus));
> + spin_lock_irq(&callback_lock);
> + cpumask_andnot(isolated_cpus, isolated_cpus, cs->effective_xcpus);
> + if (cpumask_empty(cs->exclusive_cpus))
> + cpumask_clear(cs->effective_xcpus);
> + cs->partition_root_state = PRS_INVALID_ISOLATED;
> + spin_unlock_irq(&callback_lock);
> + isolated_cpus_updating = true;
> + if (trialcs) {
> + trialcs->partition_root_state = PRS_INVALID_ISOLATED;
> + cpumask_copy(trialcs->effective_xcpus, cs->effective_xcpus);
> + }
> +}
> +
> /**
> * partition_is_populated - check if partition has tasks
> * @cs: partition root to be checked
> @@ -1160,7 +1187,8 @@ static void reset_partition_data(struct cpuset *cs)
>
> lockdep_assert_held(&callback_lock);
>
> - if (cpumask_empty(cs->exclusive_cpus)) {
> + if (cpumask_empty(cs->exclusive_cpus) &&
> + (cs->partition_root_state != PRS_INVALID_ISOLCPUS)) {
if (cpumask_empty(cs->exclusive_cpus) &&
!hotplug_invalidate_isolate(cs))
> cpumask_clear(cs->effective_xcpus);
> if (is_cpu_exclusive(cs))
> clear_bit(CS_CPU_EXCLUSIVE, &cs->flags);
> @@ -1189,6 +1217,10 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
> return;
> cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
> }
> + /*
> + * Shouldn't update isolated_cpus from CPU hotplug
> + */
> + WARN_ON_ONCE(current->flags & PF_KTHREAD);
> isolated_cpus_updating = true;
> }
>
> @@ -1208,7 +1240,6 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
> if (!parent)
> parent = &top_cpuset;
>
> -
> if (parent == &top_cpuset)
> cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
>
> @@ -1224,11 +1255,12 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
> * @old_prs: old partition_root_state
> * @parent: parent cpuset
> * @xcpus: exclusive CPUs to be removed
> + * @no_isolcpus: don't update isolated_cpus
> *
> * Remote partition if parent == NULL
> */
> static void partition_xcpus_del(int old_prs, struct cpuset *parent,
> - struct cpumask *xcpus)
> + struct cpumask *xcpus, bool no_isolcpus)
remove.
> {
> WARN_ON_ONCE(old_prs < 0);
> lockdep_assert_held(&callback_lock);
> @@ -1238,7 +1270,7 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
> if (parent == &top_cpuset)
> cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
>
> - if (old_prs != parent->partition_root_state)
> + if ((old_prs != parent->partition_root_state) && !no_isolcpus)
if ((old_prs != parent->partition_root_state) &&
!hotplug_invalidate_isolate(cs))
> isolated_cpus_update(old_prs, parent->partition_root_state,
> xcpus);
>
> @@ -1496,6 +1528,8 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
> */
> static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
> {
> + int old_prs = cs->partition_root_state;
> +
> WARN_ON_ONCE(!is_remote_partition(cs));
> /*
> * When a CPU is offlined, top_cpuset may end up with no available CPUs,
> @@ -1508,14 +1542,24 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
>
> spin_lock_irq(&callback_lock);
> cs->remote_partition = false;
> - partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus);
> if (cs->prs_err)
> cs->partition_root_state = -cs->partition_root_state;
> else
> cs->partition_root_state = PRS_MEMBER;
> + /*
> + * Don't update isolated_cpus if calling from CPU hotplug kthread
> + */
> + if ((current->flags & PF_KTHREAD) &&
> + (cs->partition_root_state == PRS_INVALID_ISOLATED))
> + cs->partition_root_state = PRS_INVALID_ISOLCPUS;
>
remove.
> - /* effective_xcpus may need to be changed */
> - compute_excpus(cs, cs->effective_xcpus);
> + partition_xcpus_del(old_prs, NULL, cs->effective_xcpus,
> + cs->partition_root_state == PRS_INVALID_ISOLCPUS);
> + /*
> + * effective_xcpus may need to be changed
> + */
> + if (cs->partition_root_state != PRS_INVALID_ISOLCPUS)
> + compute_excpus(cs, cs->effective_xcpus);
> reset_partition_data(cs);
> spin_unlock_irq(&callback_lock);
> update_isolation_cpumasks();
> @@ -1580,7 +1624,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
> if (adding)
> partition_xcpus_add(prs, NULL, tmp->addmask);
> if (deleting)
> - partition_xcpus_del(prs, NULL, tmp->delmask);
> + partition_xcpus_del(prs, NULL, tmp->delmask, false);
> /*
> * Need to update effective_xcpus and exclusive_cpus now as
> * update_sibling_cpumasks() below may iterate back to the same cs.
> @@ -1893,6 +1937,10 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
> if (!part_error)
> new_prs = -old_prs;
> break;
> + case PRS_INVALID_ISOLCPUS:
> + if (!part_error)
> + new_prs = PRS_ISOLATED;
> + break;
remove
> }
> }
>
> @@ -1923,12 +1971,19 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
> if (old_prs != new_prs)
> cs->partition_root_state = new_prs;
>
> + /*
> + * Don't update isolated_cpus if calling from CPU hotplug kthread
> + */
> + if ((current->flags & PF_KTHREAD) &&
> + (cs->partition_root_state == PRS_INVALID_ISOLATED))
> + cs->partition_root_state = PRS_INVALID_ISOLCPUS;
remove
> /*
> * Adding to parent's effective_cpus means deletion CPUs from cs
> * and vice versa.
> */
> if (adding)
> - partition_xcpus_del(old_prs, parent, tmp->addmask);
> + partition_xcpus_del(old_prs, parent, tmp->addmask,
> + cs->partition_root_state == PRS_INVALID_ISOLCPUS);
remove
> if (deleting)
> partition_xcpus_add(new_prs, parent, tmp->delmask);
>
> @@ -2317,6 +2372,7 @@ static void partition_cpus_change(struct cpuset *cs, struct cpuset *trialcs,
> if (cs_is_member(cs))
> return;
>
> + fix_invalid_isolcpus(cs, trialcs);
> prs_err = validate_partition(cs, trialcs);
> if (prs_err)
> trialcs->prs_err = cs->prs_err = prs_err;
> @@ -2818,6 +2874,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> if (alloc_tmpmasks(&tmpmask))
> return -ENOMEM;
>
> + fix_invalid_isolcpus(cs, NULL);
> err = update_partition_exclusive_flag(cs, new_prs);
> if (err)
> goto out;
> @@ -3268,6 +3325,7 @@ static int cpuset_partition_show(struct seq_file *seq, void *v)
> type = "root";
> fallthrough;
> case PRS_INVALID_ISOLATED:
> + case PRS_INVALID_ISOLCPUS:
> if (!type)
> type = "isolated";
> err = perr_strings[READ_ONCE(cs->prs_err)];
> @@ -3463,9 +3521,9 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
> }
>
> /*
> - * If a dying cpuset has the 'cpus.partition' enabled, turn it off by
> - * changing it back to member to free its exclusive CPUs back to the pool to
> - * be used by other online cpusets.
> + * If a dying cpuset has the 'cpus.partition' enabled or is in the
> + * PRS_INVALID_ISOLCPUS state, turn it off by changing it back to member to
> + * free its exclusive CPUs back to the pool to be used by other online cpusets.
> */
> static void cpuset_css_killed(struct cgroup_subsys_state *css)
> {
> @@ -3473,7 +3531,8 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
>
> cpuset_full_lock();
> /* Reset valid partition back to member */
> - if (is_partition_valid(cs))
> + if (is_partition_valid(cs) ||
> + (cs->partition_root_state == PRS_INVALID_ISOLCPUS))
if (is_partition_valid(cs) ||
(cs->partition_root_state == PRS_INVALID_ISOLATED &&
!cpumask_empty(cs->effective_xcpus)))
> update_prstate(cs, PRS_MEMBER);
> cpuset_full_unlock();
> }
Can this be simpler?
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 5dff3ad53867..380506157f70 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -234,6 +234,7 @@ TEST_MATRIX=(
> "$SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>
> # CPU offlining cases:
> + # cpuset.cpus.isolated should no longer be updated.
> " C0-1 . . C2-3 S+ C4-5 . O2=0 0 A1:0-1|B1:3"
> "C0-3:P1:S+ C2-3:P1 . . O2=0 . . . 0 A1:0-1|A2:3"
> "C0-3:P1:S+ C2-3:P1 . . O2=0 O2=1 . . 0 A1:0-1|A2:2-3"
> @@ -245,8 +246,9 @@ TEST_MATRIX=(
> "C2-3:P1:S+ C3:P2 . . O2=0 O2=1 . . 0 A1:2|A2:3 A1:P1|A2:P2"
> "C2-3:P1:S+ C3:P1 . . O2=0 . . . 0 A1:|A2:3 A1:P1|A2:P1"
> "C2-3:P1:S+ C3:P1 . . O3=0 . . . 0 A1:2|A2: A1:P1|A2:P1"
> - "C2-3:P1:S+ C3:P1 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-1"
> - "C2-3:P1:S+ C3:P1 . . . T:O3=0 . . 0 A1:2|A2:2 A1:P1|A2:P-1"
> + "C2-3:P1:S+ C3:P2 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-2"
> + "C1-3:P1:S+ C3:P2 . . . T:O3=0 . . 0 A1:1-2|A2:1-2|XA2:3 A1:P1|A2:P-2 3"
> + "C1-3:P1:S+ C3:P2 . . . T:O3=0 O3=1 . 0 A1:1-2|A2:3|XA2:3 A1:P1|A2:P2 3"
> "$SETUP_A123_PARTITIONS . O1=0 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
> "$SETUP_A123_PARTITIONS . O2=0 . . . 0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
> "$SETUP_A123_PARTITIONS . O3=0 . . . 0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -299,13 +301,14 @@ TEST_MATRIX=(
> A1:P0|A2:P2|A3:P-1 2-4"
>
> # Remote partition offline tests
> + # CPU offline shouldn't change cpuset.cpus.{isolated,exclusive.effective}
> " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 . 0 A1:0-1|A2:1|A3:3 A1:P0|A3:P2 2-3"
> " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
> - " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2|A2:1-2|A3: A1:P0|A3:P2 3"
> - " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2|A2:1-2|A3:1-2 A1:P0|A3:P-2 3|"
> + " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2|A2:1-2|A3:|XA3:3 A1:P0|A3:P2 3"
> + " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2|A2:1-2|A3:1-2|XA3:3 A1:P0|A3:P-2 3"
>
> # An invalidated remote partition cannot self-recover from hotplug
> - " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2 A1:P0|A3:P-2 ."
> + " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3|A2:1-3|A3:2|XA3:2 A1:P0|A3:P-2 2"
>
> # cpus.exclusive.effective clearing test
> " C0-3:S+ C1-3:S+ C2 . X2-3:X . . . 0 A1:0-3|A2:1-3|A3:2|XA1:"
> @@ -764,7 +767,7 @@ check_cgroup_states()
> # only CPUs in isolated partitions as well as those that are isolated at
> # boot time.
> #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
> # <isolcpus1> - expected sched/domains value
> # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
> #
> @@ -773,6 +776,7 @@ check_isolcpus()
> EXPECTED_ISOLCPUS=$1
> ISCPUS=${CGROUP2}/cpuset.cpus.isolated
> ISOLCPUS=$(cat $ISCPUS)
> + HKICPUS=$(cat /sys/devices/system/cpu/isolated)
> LASTISOLCPU=
> SCHED_DOMAINS=/sys/kernel/debug/sched/domains
> if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -810,6 +814,11 @@ check_isolcpus()
> ISOLCPUS=
> EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>
> + #
> + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> + #
> + [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
> #
> # Use the sched domain in debugfs to check isolated CPUs, if available
> #
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug
2026-02-13 6:56 ` Chen Ridong
@ 2026-02-21 19:18 ` Waiman Long
0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-02-21 19:18 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2/13/26 1:56 AM, Chen Ridong wrote:
>
> On 2026/2/13 0:46, Waiman Long wrote:
>> As any change to isolated_cpus is going to be propagated to the
>> HK_TYPE_DOMAIN housekeeping cpumask, it can be problematic if
>> housekeeping cpumasks are directly being modified from the CPU hotplug
>> code path. This is especially the case if we are going to enable dynamic
>> update to the nohz_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE)
>> in the near future with the help of CPU hotplug.
>>
>> Avoid these potential problems by changing the cpuset code to not
>> updating isolated_cpus when calling from CPU hotplug. A new special
>> PRS_INVALID_ISOLCPUS is added to indicate the current cpuset is an
>> invalid partition but its effective_xcpus are still in isolated_cpus.
>> This special state will be set if an isolated partition becomes invalid
>> due to the shutdown of the last active CPU in that partition. We also
>> need to keep the effective_xcpus even if exclusive_cpus isn't set.
>>
>> When changes are made to "cpuset.cpus", "cpuset.cpus.exclusive" or
>> "cpuset.cpus.partition" of a PRS_INVALID_ISOLCPUS cpuset, its state
>> will be reset back to PRS_INVALID_ISOLATED and its effective_xcpus will
>> be removed from isolated_cpus before proceeding.
>>
>> As CPU hotplug will no longer update isolated_cpus, some of the test
>> cases in test_cpuset_prs.h will have to be updated to match the new
>> expected results. Some new test cases are also added to confirm that
>> "cpuset.cpus.isolated" and HK_TYPE_DOMAIN housekeeping cpumask will
>> both be updated.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 85 ++++++++++++++++---
>> .../selftests/cgroup/test_cpuset_prs.sh | 21 +++--
>> 2 files changed, 87 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index c792380f9b60..48b7f275085b 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -159,6 +159,8 @@ static bool force_sd_rebuild; /* RWCS */
>> * 2 - partition root without load balancing (isolated)
>> * -1 - invalid partition root
>> * -2 - invalid isolated partition root
>> + * -3 - invalid isolated partition root but with effective xcpus still
>> + * in isolated_cpus (set from CPU hotplug side)
>> *
>> * There are 2 types of partitions - local or remote. Local partitions are
>> * those whose parents are partition root themselves. Setting of
>> @@ -187,6 +189,7 @@ static bool force_sd_rebuild; /* RWCS */
>> #define PRS_ISOLATED 2
>> #define PRS_INVALID_ROOT -1
>> #define PRS_INVALID_ISOLATED -2
>> +#define PRS_INVALID_ISOLCPUS -3 /* Effective xcpus still in isolated_cpus */
>>
> How about adding a helper?
>
> bool hotplug_invalidate_isolate(struct cpuset *cs)
> {
> if (current->flags & PF_KTHREAD) &&
> (cs->partition_root_state == PRS_INVALID_ISOLATED);
> }
>
I decided to revert back to the v4 way of deferring
housekeeping_update() call to a workqueue to make it simple, but do it
in a simple and straight forward way.
Thanks,
Longman
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
2026-02-12 16:46 [PATCH v5 0/6] cgroup/cpuset: Fix partition related locking issues Waiman Long
` (3 preceding siblings ...)
2026-02-12 16:46 ` [PATCH v5 4/6] cgroup/cpuset: Don't update isolated_cpus from CPU hotplug Waiman Long
@ 2026-02-12 16:46 ` Waiman Long
2026-02-13 7:47 ` Chen Ridong
2026-02-12 16:46 ` [PATCH v5 6/6] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls Waiman Long
5 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-02-12 16:46 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.
The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.
======================================================
WARNING: possible circular locking dependency detected
6.18.0-test+ #2 Tainted: G S
------------------------------------------------------
test_cpuset_prs/10971 is trying to acquire lock:
ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
but task is already holding lock:
ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (cpuset_mutex){+.+.}-{4:4}:
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
-> #2 (rtnl_mutex){+.+.}-{4:4}:
-> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
-> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
Chain exists of:
(wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
5 locks held by test_cpuset_prs/10971:
#0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
#2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
#3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
#4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
Call Trace:
<TASK>
:
touch_wq_lockdep_map+0x93/0x180
__flush_workqueue+0x111/0x10b0
housekeeping_update+0x12d/0x2d0
update_parent_effective_cpumask+0x595/0x2440
update_prstate+0x89d/0xce0
cpuset_partition_write+0xc5/0x130
cgroup_file_write+0x1a5/0x680
kernfs_fop_write_iter+0x3df/0x5f0
vfs_write+0x525/0xfd0
ksys_write+0xf9/0x1d0
do_syscall_64+0x95/0x520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.
One way to do that is to defer the housekeeping_update() call after
the current cpuset critical section has finished without holding
cpus_read_lock. For cpuset control file write, this can be done by
deferring it using task_work right before returning to userspace.
To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.
As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.
The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 99 ++++++++++++++++++++++++++++++++---
kernel/sched/isolation.c | 4 +-
kernel/time/timer_migration.c | 4 +-
3 files changed, 93 insertions(+), 14 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 48b7f275085b..c6a97956a991 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
* CPUSET Locking Convention
* -------------------------
*
- * Below are the three global locks guarding cpuset structures in lock
+ * Below are the four global/local locks guarding cpuset structures in lock
* acquisition order:
+ * - cpuset_top_mutex
* - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
* - cpuset_mutex
* - callback_lock (raw spinlock)
*
- * A task must hold all the three locks to modify externally visible or
- * used fields of cpusets, though some of the internally used cpuset fields
+ * As cpuset will now indirectly flush a number of different workqueues in
+ * housekeeping_update() to update housekeeping cpumasks when the set of
+ * isolated CPUs is going to be changed, it may be vulnerable to deadlock
+ * if we hold cpus_read_lock while calling into housekeeping_update().
+ *
+ * The first cpuset_top_mutex will be held except when calling into
+ * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
+ * and cpuset_mutex will be held instead. The main purpose of this mutex
+ * is to prevent regular cpuset control file write actions from interfering
+ * with the call to housekeeping_update(), though CPU hotplug operation can
+ * still happen in parallel. This mutex also provides protection for some
+ * internal variables.
+ *
+ * A task must hold all the remaining three locks to modify externally visible
+ * or used fields of cpusets, though some of the internally used cpuset fields
* and internal variables can be modified without holding callback_lock. If only
* reliable read access of the externally used fields are needed, a task can
* hold either cpuset_mutex or callback_lock which are exposed to other
@@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
* cpumasks and nodemasks.
*/
+static DEFINE_MUTEX(cpuset_top_mutex);
static DEFINE_MUTEX(cpuset_mutex);
/*
@@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
*
* CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
* by holding both cpuset_mutex and callback_lock.
+ *
+ * T: Read/write-able by holding the cpuset_top_mutex.
*/
/*
@@ -135,6 +152,18 @@ static cpumask_var_t isolated_cpus; /* CSCB */
*/
static bool isolated_cpus_updating; /* RWCS */
+/*
+ * Copy of isolated_cpus to be passed to housekeeping_update()
+ */
+static cpumask_var_t isolated_hk_cpus; /* T */
+
+/*
+ * Flag to prevent queuing more than one task_work to the same cpuset_top_mutex
+ * critical section.
+ */
+static bool isolcpus_twork_queued; /* T */
+
+
/*
* A flag to force sched domain rebuild at the end of an operation.
* It can be set in
@@ -301,20 +330,24 @@ void lockdep_assert_cpuset_lock_held(void)
*/
void cpuset_full_lock(void)
{
+ mutex_lock(&cpuset_top_mutex);
cpus_read_lock();
mutex_lock(&cpuset_mutex);
}
void cpuset_full_unlock(void)
{
+ isolcpus_twork_queued = false;
mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
+ mutex_unlock(&cpuset_top_mutex);
}
#ifdef CONFIG_LOCKDEP
bool lockdep_is_cpuset_held(void)
{
- return lockdep_is_held(&cpuset_mutex);
+ return lockdep_is_held(&cpuset_mutex) ||
+ lockdep_is_held(&cpuset_top_mutex);
}
#endif
@@ -1338,6 +1371,28 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
return false;
}
+/*
+ * housekeeping_update() will only be called if isolated_cpus differs
+ * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always
+ * be called just in case there are still pending sched domains changes.
+ */
+static void isolcpus_tworkfn(struct callback_head *cb)
+{
+ bool update_hk = true;
+
+ guard(mutex)(&cpuset_top_mutex);
+ scoped_guard(spinlock_irq, &callback_lock) {
+ if (cpumask_equal(isolated_hk_cpus, isolated_cpus))
+ update_hk = false;
+ else
+ cpumask_copy(isolated_hk_cpus, isolated_cpus);
+ }
+ if (update_hk)
+ WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0);
+ rebuild_sched_domains();
+ kfree(cb);
+}
+
/*
* update_isolation_cpumasks - Update external isolation related CPU masks
*
@@ -1346,15 +1401,42 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
*/
static void update_isolation_cpumasks(void)
{
- int ret;
+ struct callback_head *twork_cb;
if (!isolated_cpus_updating)
return;
+ else
+ isolated_cpus_updating = false;
+
+ /*
+ * CPU hotplug shouldn't set isolated_cpus_updating.
+ *
+ * To have better flexibility and prevent the possibility of deadlock,
+ * we defer the housekeeping_update() call to after the current cpuset
+ * critical section has finished. This is done via the synchronous
+ * task_work which will be executed right before returning to userspace.
+ *
+ * update_isolation_cpumasks() may be called more than once in the
+ * same cpuset_mutex critical section.
+ */
+ lockdep_assert_held(&cpuset_top_mutex);
+ if (isolcpus_twork_queued)
+ return;
- ret = housekeeping_update(isolated_cpus);
- WARN_ON_ONCE(ret < 0);
+ twork_cb = kzalloc(sizeof(struct callback_head), GFP_KERNEL);
+ if (!twork_cb)
+ return;
- isolated_cpus_updating = false;
+ /*
+ * isolcpus_tworkfn() will be invoked before returning to userspace
+ */
+ init_task_work(twork_cb, isolcpus_tworkfn);
+ if (task_work_add(current, twork_cb, TWA_RESUME)) {
+ kfree(twork_cb);
+ WARN_ON_ONCE(1); /* Current task shouldn't be exiting */
+ } else {
+ isolcpus_twork_queued = true;
+ }
}
/**
@@ -3689,6 +3771,7 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b725d39c06e..ef152d401fe2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
struct cpumask *trial, *old = NULL;
int err;
- lockdep_assert_cpus_held();
-
trial = kmalloc(cpumask_size(), GFP_KERNEL);
if (!trial)
return -ENOMEM;
@@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
}
if (!housekeeping.flags)
- static_branch_enable_cpuslocked(&housekeeping_overridden);
+ static_branch_enable(&housekeeping_overridden);
if (housekeeping.flags & HK_FLAG_DOMAIN)
old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 6da9cd562b20..83428aa03aef 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
int cpu;
- lockdep_assert_cpus_held();
-
if (!works)
return -ENOMEM;
if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
@@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
* First set previously isolated CPUs as available (unisolate).
* This cpumask contains only CPUs that switched to available now.
*/
+ guard(cpus_read_lock)();
cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
@@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void)
cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
/* Protect against RCU torture hotplug testing */
- guard(cpus_read_lock)();
return tmigr_isolated_exclude_cpumask(cpumask);
}
late_initcall(tmigr_init_isolation);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
2026-02-12 16:46 ` [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
@ 2026-02-13 7:47 ` Chen Ridong
2026-02-21 19:20 ` Waiman Long
0 siblings, 1 reply; 16+ messages in thread
From: Chen Ridong @ 2026-02-13 7:47 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
Hi Longman:
On 2026/2/13 0:46, Waiman Long wrote:
> The current cpuset partition code is able to dynamically update
> the sched domains of a running system and the corresponding
> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
> "isolcpus=domain,..." boot command line feature at run time.
>
> The housekeeping cpumask update requires flushing a number of different
> workqueues which may not be safe with cpus_read_lock() held as the
> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
> which have locking dependency with cpus_read_lock() down the chain. Below
> is an example of such circular locking problem.
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.18.0-test+ #2 Tainted: G S
> ------------------------------------------------------
> test_cpuset_prs/10971 is trying to acquire lock:
> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
>
> but task is already holding lock:
> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
> -> #4 (cpuset_mutex){+.+.}-{4:4}:
> -> #3 (cpu_hotplug_lock){++++}-{0:0}:
> -> #2 (rtnl_mutex){+.+.}-{4:4}:
> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>
> Chain exists of:
> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>
> 5 locks held by test_cpuset_prs/10971:
> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>
> Call Trace:
> <TASK>
> :
> touch_wq_lockdep_map+0x93/0x180
> __flush_workqueue+0x111/0x10b0
> housekeeping_update+0x12d/0x2d0
> update_parent_effective_cpumask+0x595/0x2440
> update_prstate+0x89d/0xce0
> cpuset_partition_write+0xc5/0x130
> cgroup_file_write+0x1a5/0x680
> kernfs_fop_write_iter+0x3df/0x5f0
> vfs_write+0x525/0xfd0
> ksys_write+0xf9/0x1d0
> do_syscall_64+0x95/0x520
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> To avoid such a circular locking dependency problem, we have to
> call housekeeping_update() without holding the cpus_read_lock() and
> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
> may not have work functions that call cpus_read_lock() directly,
> but we are likely to extend the list of wq's that are flushed in the
> future. Moreover, the current set of work functions may hold locks that
> may have cpu_hotplug_lock down the dependency chain.
>
> One way to do that is to defer the housekeeping_update() call after
> the current cpuset critical section has finished without holding
> cpus_read_lock. For cpuset control file write, this can be done by
> deferring it using task_work right before returning to userspace.
>
> To enable mutual exclusion between the housekeeping_update() call and
> other cpuset control file write actions, a new top level cpuset_top_mutex
> is introduced. This new mutex will be acquired first to allow sharing
> variables used by both code paths. However, cpuset update from CPU
> hotplug can still happen in parallel with the housekeeping_update()
> call, though that should be rare in production environment.
>
> As cpus_read_lock() is now no longer held when
> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
> directly.
>
> The lockdep_is_cpuset_held() is also updated to return true if either
> cpuset_top_mutex or cpuset_mutex is held.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 99 ++++++++++++++++++++++++++++++++---
> kernel/sched/isolation.c | 4 +-
> kernel/time/timer_migration.c | 4 +-
> 3 files changed, 93 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 48b7f275085b..c6a97956a991 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
> * CPUSET Locking Convention
> * -------------------------
> *
> - * Below are the three global locks guarding cpuset structures in lock
> + * Below are the four global/local locks guarding cpuset structures in lock
> * acquisition order:
> + * - cpuset_top_mutex
> * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
> * - cpuset_mutex
> * - callback_lock (raw spinlock)
> *
> - * A task must hold all the three locks to modify externally visible or
> - * used fields of cpusets, though some of the internally used cpuset fields
> + * As cpuset will now indirectly flush a number of different workqueues in
> + * housekeeping_update() to update housekeeping cpumasks when the set of
> + * isolated CPUs is going to be changed, it may be vulnerable to deadlock
> + * if we hold cpus_read_lock while calling into housekeeping_update().
> + *
> + * The first cpuset_top_mutex will be held except when calling into
> + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
> + * and cpuset_mutex will be held instead. The main purpose of this mutex
> + * is to prevent regular cpuset control file write actions from interfering
> + * with the call to housekeeping_update(), though CPU hotplug operation can
> + * still happen in parallel. This mutex also provides protection for some
> + * internal variables.
> + *
> + * A task must hold all the remaining three locks to modify externally visible
> + * or used fields of cpusets, though some of the internally used cpuset fields
> * and internal variables can be modified without holding callback_lock. If only
> * reliable read access of the externally used fields are needed, a task can
> * hold either cpuset_mutex or callback_lock which are exposed to other
> @@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
> * cpumasks and nodemasks.
> */
>
> +static DEFINE_MUTEX(cpuset_top_mutex);
> static DEFINE_MUTEX(cpuset_mutex);
>
> /*
> @@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
> *
> * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
> * by holding both cpuset_mutex and callback_lock.
> + *
> + * T: Read/write-able by holding the cpuset_top_mutex.
> */
>
> /*
> @@ -135,6 +152,18 @@ static cpumask_var_t isolated_cpus; /* CSCB */
> */
> static bool isolated_cpus_updating; /* RWCS */
>
> +/*
> + * Copy of isolated_cpus to be passed to housekeeping_update()
> + */
> +static cpumask_var_t isolated_hk_cpus; /* T */
> +
> +/*
> + * Flag to prevent queuing more than one task_work to the same cpuset_top_mutex
> + * critical section.
> + */
> +static bool isolcpus_twork_queued; /* T */
> +
> +
> /*
> * A flag to force sched domain rebuild at the end of an operation.
> * It can be set in
> @@ -301,20 +330,24 @@ void lockdep_assert_cpuset_lock_held(void)
> */
> void cpuset_full_lock(void)
> {
> + mutex_lock(&cpuset_top_mutex);
> cpus_read_lock();
> mutex_lock(&cpuset_mutex);
> }
>
> void cpuset_full_unlock(void)
> {
> + isolcpus_twork_queued = false;
This is odd.
> mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> + mutex_unlock(&cpuset_top_mutex);
> }
>
> #ifdef CONFIG_LOCKDEP
> bool lockdep_is_cpuset_held(void)
> {
> - return lockdep_is_held(&cpuset_mutex);
> + return lockdep_is_held(&cpuset_mutex) ||
> + lockdep_is_held(&cpuset_top_mutex);
> }
> #endif
>
> @@ -1338,6 +1371,28 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
> return false;
> }
>
> +/*
> + * housekeeping_update() will only be called if isolated_cpus differs
> + * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always
> + * be called just in case there are still pending sched domains changes.
> + */
> +static void isolcpus_tworkfn(struct callback_head *cb)
> +{
> + bool update_hk = true;
> +
> + guard(mutex)(&cpuset_top_mutex);
> + scoped_guard(spinlock_irq, &callback_lock) {
> + if (cpumask_equal(isolated_hk_cpus, isolated_cpus))
> + update_hk = false;
> + else
> + cpumask_copy(isolated_hk_cpus, isolated_cpus);
> + }
> + if (update_hk)
> + WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0);
> + rebuild_sched_domains();
> + kfree(cb);
> +}
> +
> /*
> * update_isolation_cpumasks - Update external isolation related CPU masks
> *
> @@ -1346,15 +1401,42 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
> */
> static void update_isolation_cpumasks(void)
> {
> - int ret;
> + struct callback_head *twork_cb;
>
> if (!isolated_cpus_updating)
> return;
> + else
> + isolated_cpus_updating = false;
> +
> + /*
> + * CPU hotplug shouldn't set isolated_cpus_updating.
> + *
> + * To have better flexibility and prevent the possibility of deadlock,
> + * we defer the housekeeping_update() call to after the current cpuset
> + * critical section has finished. This is done via the synchronous
> + * task_work which will be executed right before returning to userspace.
> + *
> + * update_isolation_cpumasks() may be called more than once in the
> + * same cpuset_mutex critical section.
> + */
> + lockdep_assert_held(&cpuset_top_mutex);
> + if (isolcpus_twork_queued)
> + return;
>
> - ret = housekeeping_update(isolated_cpus);
> - WARN_ON_ONCE(ret < 0);
> + twork_cb = kzalloc(sizeof(struct callback_head), GFP_KERNEL);
> + if (!twork_cb)
> + return;
>
> - isolated_cpus_updating = false;
> + /*
> + * isolcpus_tworkfn() will be invoked before returning to userspace
> + */
> + init_task_work(twork_cb, isolcpus_tworkfn);
> + if (task_work_add(current, twork_cb, TWA_RESUME)) {
> + kfree(twork_cb);
> + WARN_ON_ONCE(1); /* Current task shouldn't be exiting */
> + } else {
> + isolcpus_twork_queued = true;
> + }
> }
>
Actually, I find this function quite complex, with numerous global
variables to maintain.
I'm considering whether we can simplify it. Could we just call
update_isolation_cpumasks() at the end of update_prstate(),
update_cpumask(), and update_exclusive_cpumask()?
i.e.
static void update_isolation_cpumasks(void)
{
struct callback_head twork_cb
if (!isolated_cpus_updating)
return;
task_work_add(...)
isolated_cpus_updating = false;
}
static int update_prstate(struct cpuset *cs, int new_prs)
{
...
free_tmpmasks(&tmpmask);
update_isolation_cpumasks();
return 0;
}
For rebuilding scheduling domains, we could rebuild them during the
set operation only when force_sd_rebuild = true and
!isolated_cpus_updating. Otherwise, the rebuild would be deferred
and performed only once in isolcpus_tworkfn().
> /**
> @@ -3689,6 +3771,7 @@ int __init cpuset_init(void)
> BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
> BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
> BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
> + BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
>
> cpumask_setall(top_cpuset.cpus_allowed);
> nodes_setall(top_cpuset.mems_allowed);
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 3b725d39c06e..ef152d401fe2 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
> struct cpumask *trial, *old = NULL;
> int err;
>
> - lockdep_assert_cpus_held();
> -
> trial = kmalloc(cpumask_size(), GFP_KERNEL);
> if (!trial)
> return -ENOMEM;
> @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
> }
>
> if (!housekeeping.flags)
> - static_branch_enable_cpuslocked(&housekeeping_overridden);
> + static_branch_enable(&housekeeping_overridden);
>
> if (housekeeping.flags & HK_FLAG_DOMAIN)
> old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 6da9cd562b20..83428aa03aef 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
> cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
> int cpu;
>
> - lockdep_assert_cpus_held();
> -
> if (!works)
> return -ENOMEM;
> if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
> * First set previously isolated CPUs as available (unisolate).
> * This cpumask contains only CPUs that switched to available now.
> */
> + guard(cpus_read_lock)();
> cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
> cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
>
> @@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void)
> cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
>
> /* Protect against RCU torture hotplug testing */
> - guard(cpus_read_lock)();
> return tmigr_isolated_exclude_cpumask(cpumask);
> }
> late_initcall(tmigr_init_isolation);
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
2026-02-13 7:47 ` Chen Ridong
@ 2026-02-21 19:20 ` Waiman Long
0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-02-21 19:20 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Frederic Weisbecker, Thomas Gleixner, Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest
On 2/13/26 2:47 AM, Chen Ridong wrote:
> Hi Longman:
>
> On 2026/2/13 0:46, Waiman Long wrote:
>> The current cpuset partition code is able to dynamically update
>> the sched domains of a running system and the corresponding
>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>> "isolcpus=domain,..." boot command line feature at run time.
>>
>> The housekeeping cpumask update requires flushing a number of different
>> workqueues which may not be safe with cpus_read_lock() held as the
>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>> which have locking dependency with cpus_read_lock() down the chain. Below
>> is an example of such circular locking problem.
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.18.0-test+ #2 Tainted: G S
>> ------------------------------------------------------
>> test_cpuset_prs/10971 is trying to acquire lock:
>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
>>
>> but task is already holding lock:
>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>> -> #4 (cpuset_mutex){+.+.}-{4:4}:
>> -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>> -> #2 (rtnl_mutex){+.+.}-{4:4}:
>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>
>> Chain exists of:
>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>
>> 5 locks held by test_cpuset_prs/10971:
>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>> Call Trace:
>> <TASK>
>> :
>> touch_wq_lockdep_map+0x93/0x180
>> __flush_workqueue+0x111/0x10b0
>> housekeeping_update+0x12d/0x2d0
>> update_parent_effective_cpumask+0x595/0x2440
>> update_prstate+0x89d/0xce0
>> cpuset_partition_write+0xc5/0x130
>> cgroup_file_write+0x1a5/0x680
>> kernfs_fop_write_iter+0x3df/0x5f0
>> vfs_write+0x525/0xfd0
>> ksys_write+0xf9/0x1d0
>> do_syscall_64+0x95/0x520
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> To avoid such a circular locking dependency problem, we have to
>> call housekeeping_update() without holding the cpus_read_lock() and
>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>> may not have work functions that call cpus_read_lock() directly,
>> but we are likely to extend the list of wq's that are flushed in the
>> future. Moreover, the current set of work functions may hold locks that
>> may have cpu_hotplug_lock down the dependency chain.
>>
>> One way to do that is to defer the housekeeping_update() call after
>> the current cpuset critical section has finished without holding
>> cpus_read_lock. For cpuset control file write, this can be done by
>> deferring it using task_work right before returning to userspace.
>>
>> To enable mutual exclusion between the housekeeping_update() call and
>> other cpuset control file write actions, a new top level cpuset_top_mutex
>> is introduced. This new mutex will be acquired first to allow sharing
>> variables used by both code paths. However, cpuset update from CPU
>> hotplug can still happen in parallel with the housekeeping_update()
>> call, though that should be rare in production environment.
>>
>> As cpus_read_lock() is now no longer held when
>> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
>> directly.
>>
>> The lockdep_is_cpuset_held() is also updated to return true if either
>> cpuset_top_mutex or cpuset_mutex is held.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 99 ++++++++++++++++++++++++++++++++---
>> kernel/sched/isolation.c | 4 +-
>> kernel/time/timer_migration.c | 4 +-
>> 3 files changed, 93 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 48b7f275085b..c6a97956a991 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
>> * CPUSET Locking Convention
>> * -------------------------
>> *
>> - * Below are the three global locks guarding cpuset structures in lock
>> + * Below are the four global/local locks guarding cpuset structures in lock
>> * acquisition order:
>> + * - cpuset_top_mutex
>> * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
>> * - cpuset_mutex
>> * - callback_lock (raw spinlock)
>> *
>> - * A task must hold all the three locks to modify externally visible or
>> - * used fields of cpusets, though some of the internally used cpuset fields
>> + * As cpuset will now indirectly flush a number of different workqueues in
>> + * housekeeping_update() to update housekeeping cpumasks when the set of
>> + * isolated CPUs is going to be changed, it may be vulnerable to deadlock
>> + * if we hold cpus_read_lock while calling into housekeeping_update().
>> + *
>> + * The first cpuset_top_mutex will be held except when calling into
>> + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
>> + * and cpuset_mutex will be held instead. The main purpose of this mutex
>> + * is to prevent regular cpuset control file write actions from interfering
>> + * with the call to housekeeping_update(), though CPU hotplug operation can
>> + * still happen in parallel. This mutex also provides protection for some
>> + * internal variables.
>> + *
>> + * A task must hold all the remaining three locks to modify externally visible
>> + * or used fields of cpusets, though some of the internally used cpuset fields
>> * and internal variables can be modified without holding callback_lock. If only
>> * reliable read access of the externally used fields are needed, a task can
>> * hold either cpuset_mutex or callback_lock which are exposed to other
>> @@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
>> * cpumasks and nodemasks.
>> */
>>
>> +static DEFINE_MUTEX(cpuset_top_mutex);
>> static DEFINE_MUTEX(cpuset_mutex);
>>
>> /*
>> @@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
>> *
>> * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
>> * by holding both cpuset_mutex and callback_lock.
>> + *
>> + * T: Read/write-able by holding the cpuset_top_mutex.
>> */
>>
>> /*
>> @@ -135,6 +152,18 @@ static cpumask_var_t isolated_cpus; /* CSCB */
>> */
>> static bool isolated_cpus_updating; /* RWCS */
>>
>> +/*
>> + * Copy of isolated_cpus to be passed to housekeeping_update()
>> + */
>> +static cpumask_var_t isolated_hk_cpus; /* T */
>> +
>> +/*
>> + * Flag to prevent queuing more than one task_work to the same cpuset_top_mutex
>> + * critical section.
>> + */
>> +static bool isolcpus_twork_queued; /* T */
>> +
>> +
>> /*
>> * A flag to force sched domain rebuild at the end of an operation.
>> * It can be set in
>> @@ -301,20 +330,24 @@ void lockdep_assert_cpuset_lock_held(void)
>> */
>> void cpuset_full_lock(void)
>> {
>> + mutex_lock(&cpuset_top_mutex);
>> cpus_read_lock();
>> mutex_lock(&cpuset_mutex);
>> }
>>
>> void cpuset_full_unlock(void)
>> {
>> + isolcpus_twork_queued = false;
> This is odd.
>
>> mutex_unlock(&cpuset_mutex);
>> cpus_read_unlock();
>> + mutex_unlock(&cpuset_top_mutex);
>> }
>>
>> #ifdef CONFIG_LOCKDEP
>> bool lockdep_is_cpuset_held(void)
>> {
>> - return lockdep_is_held(&cpuset_mutex);
>> + return lockdep_is_held(&cpuset_mutex) ||
>> + lockdep_is_held(&cpuset_top_mutex);
>> }
>> #endif
>>
>> @@ -1338,6 +1371,28 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>> return false;
>> }
>>
>> +/*
>> + * housekeeping_update() will only be called if isolated_cpus differs
>> + * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always
>> + * be called just in case there are still pending sched domains changes.
>> + */
>> +static void isolcpus_tworkfn(struct callback_head *cb)
>> +{
>> + bool update_hk = true;
>> +
>> + guard(mutex)(&cpuset_top_mutex);
>> + scoped_guard(spinlock_irq, &callback_lock) {
>> + if (cpumask_equal(isolated_hk_cpus, isolated_cpus))
>> + update_hk = false;
>> + else
>> + cpumask_copy(isolated_hk_cpus, isolated_cpus);
>> + }
>> + if (update_hk)
>> + WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0);
>> + rebuild_sched_domains();
>> + kfree(cb);
>> +}
>> +
>> /*
>> * update_isolation_cpumasks - Update external isolation related CPU masks
>> *
>> @@ -1346,15 +1401,42 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>> */
>> static void update_isolation_cpumasks(void)
>> {
>> - int ret;
>> + struct callback_head *twork_cb;
>>
>> if (!isolated_cpus_updating)
>> return;
>> + else
>> + isolated_cpus_updating = false;
>> +
>> + /*
>> + * CPU hotplug shouldn't set isolated_cpus_updating.
>> + *
>> + * To have better flexibility and prevent the possibility of deadlock,
>> + * we defer the housekeeping_update() call to after the current cpuset
>> + * critical section has finished. This is done via the synchronous
>> + * task_work which will be executed right before returning to userspace.
>> + *
>> + * update_isolation_cpumasks() may be called more than once in the
>> + * same cpuset_mutex critical section.
>> + */
>> + lockdep_assert_held(&cpuset_top_mutex);
>> + if (isolcpus_twork_queued)
>> + return;
>>
>> - ret = housekeeping_update(isolated_cpus);
>> - WARN_ON_ONCE(ret < 0);
>> + twork_cb = kzalloc(sizeof(struct callback_head), GFP_KERNEL);
>> + if (!twork_cb)
>> + return;
>>
>> - isolated_cpus_updating = false;
>> + /*
>> + * isolcpus_tworkfn() will be invoked before returning to userspace
>> + */
>> + init_task_work(twork_cb, isolcpus_tworkfn);
>> + if (task_work_add(current, twork_cb, TWA_RESUME)) {
>> + kfree(twork_cb);
>> + WARN_ON_ONCE(1); /* Current task shouldn't be exiting */
>> + } else {
>> + isolcpus_twork_queued = true;
>> + }
>> }
>>
> Actually, I find this function quite complex, with numerous global
> variables to maintain.
>
> I'm considering whether we can simplify it. Could we just call
> update_isolation_cpumasks() at the end of update_prstate(),
> update_cpumask(), and update_exclusive_cpumask()?
>
> i.e.
>
> static void update_isolation_cpumasks(void)
> {
> struct callback_head twork_cb
>
> if (!isolated_cpus_updating)
> return;
> task_work_add(...)
> isolated_cpus_updating = false;
> }
>
> static int update_prstate(struct cpuset *cs, int new_prs)
> {
> ...
> free_tmpmasks(&tmpmask);
> update_isolation_cpumasks();
> return 0;
> }
>
> For rebuilding scheduling domains, we could rebuild them during the
> set operation only when force_sd_rebuild = true and
> !isolated_cpus_updating. Otherwise, the rebuild would be deferred
> and performed only once in isolcpus_tworkfn().
Yes, the v5 series make the code more complex, so now I am reverting
back to a more simple way.
Cheers,
Longman
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 6/6] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls
2026-02-12 16:46 [PATCH v5 0/6] cgroup/cpuset: Fix partition related locking issues Waiman Long
` (4 preceding siblings ...)
2026-02-12 16:46 ` [PATCH v5 5/6] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
@ 2026-02-12 16:46 ` Waiman Long
5 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-02-12 16:46 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
Now that we are going to defer any changes to the HK_TYPE_DOMAIN
housekeeping cpumasks to either task_work or workqueue
where rebuild_sched_domains() call will be issued. The current
rebuild_sched_domains_locked() call near the end of the cpuset critical
section can be removed in such cases.
Currently, a boolean force_sd_rebuild flag is used to decide if
rebuild_sched_domains_locked() call needs to be invoked. To allow
deferral that like, we change it to a tri-state sd_rebuild enumaration
type.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c6a97956a991..426949363ca7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -178,7 +178,11 @@ static bool isolcpus_twork_queued; /* T */
* Note that update_relax_domain_level() in cpuset-v1.c can still call
* rebuild_sched_domains_locked() directly without using this flag.
*/
-static bool force_sd_rebuild; /* RWCS */
+static enum {
+ SD_NO_REBUILD = 0,
+ SD_REBUILD,
+ SD_DEFER_REBUILD,
+} sd_rebuild; /* RWCS */
/*
* Partition root states:
@@ -1023,7 +1027,7 @@ void rebuild_sched_domains_locked(void)
lockdep_assert_cpus_held();
lockdep_assert_cpuset_lock_held();
- force_sd_rebuild = false;
+ sd_rebuild = SD_NO_REBUILD;
/* Generate domain masks and attrs */
ndoms = generate_sched_domains(&doms, &attr);
@@ -1408,6 +1412,9 @@ static void update_isolation_cpumasks(void)
else
isolated_cpus_updating = false;
+ /* Defer rebuild_sched_domains() to task_work or wq */
+ sd_rebuild = SD_DEFER_REBUILD;
+
/*
* CPU hotplug shouldn't set isolated_cpus_updating.
*
@@ -3053,7 +3060,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
update_partition_sd_lb(cs, old_prs);
notify_partition_change(cs, old_prs);
- if (force_sd_rebuild)
+ if (sd_rebuild == SD_REBUILD)
rebuild_sched_domains_locked();
free_tmpmasks(&tmpmask);
return 0;
@@ -3330,7 +3337,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
}
free_cpuset(trialcs);
- if (force_sd_rebuild)
+ if (sd_rebuild == SD_REBUILD)
rebuild_sched_domains_locked();
out_unlock:
cpuset_full_unlock();
@@ -3815,7 +3822,8 @@ hotplug_update_tasks(struct cpuset *cs,
void cpuset_force_rebuild(void)
{
- force_sd_rebuild = true;
+ if (!sd_rebuild)
+ sd_rebuild = SD_REBUILD;
}
/**
@@ -4025,7 +4033,7 @@ static void cpuset_handle_hotplug(void)
}
/* rebuild sched domains if necessary */
- if (force_sd_rebuild)
+ if (sd_rebuild == SD_REBUILD)
rebuild_sched_domains_cpuslocked();
free_tmpmasks(ptmp);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread