* [PATCH 07/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
[not found] <20250829154814.47015-1-frederic@kernel.org>
@ 2025-08-29 15:47 ` Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 12/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
` (6 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:47 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Johannes Weiner, Marco Crivellari,
Michal Hocko, Michal Koutny, Peter Zijlstra, Tejun Heo,
Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups
boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use
the official version.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/cgroup/cpuset.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 27adb04df675..b00d8e3c30ba 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -80,12 +80,6 @@ static cpumask_var_t subpartitions_cpus;
*/
static cpumask_var_t isolated_cpus;
-/*
- * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot
- */
-static cpumask_var_t boot_hk_cpus;
-static bool have_boot_isolcpus;
-
/* List of remote partition root children */
static struct list_head remote_children;
@@ -1601,15 +1595,16 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
* @new_cpus: cpu mask
* Return: true if there is conflict, false otherwise
*
- * CPUs outside of boot_hk_cpus, if defined, can only be used in an
+ * CPUs outside of HK_TYPE_DOMAIN_BOOT, if defined, can only be used in an
* isolated partition.
*/
static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
{
- if (!have_boot_isolcpus)
+ if (!housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
return false;
- if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus))
+ if ((prstate != PRS_ISOLATED) &&
+ !cpumask_subset(new_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)))
return true;
return false;
@@ -3764,12 +3759,9 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
- have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN);
- if (have_boot_isolcpus) {
- BUG_ON(!alloc_cpumask_var(&boot_hk_cpus, GFP_KERNEL));
- cpumask_copy(boot_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN));
- cpumask_andnot(isolated_cpus, cpu_possible_mask, boot_hk_cpus);
- }
+ if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
+ cpumask_andnot(isolated_cpus, cpu_possible_mask,
+ housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 12/33] cpuset: Provide lockdep check for cpuset lock held
[not found] <20250829154814.47015-1-frederic@kernel.org>
2025-08-29 15:47 ` [PATCH 07/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
@ 2025-08-29 15:47 ` Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 14/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
` (5 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:47 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Johannes Weiner,
Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups
cpuset modifies partitions, including isolated, while holding the cpuset
mutex.
This means that holding the cpuset mutex is safe to synchronize against
housekeeping cpumask changes.
Provide a lockdep check to validate that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/cpuset.h | 2 ++
kernel/cgroup/cpuset.c | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..051d36fec578 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -18,6 +18,8 @@
#include <linux/mmu_context.h>
#include <linux/jump_label.h>
+extern bool lockdep_is_cpuset_held(void);
+
#ifdef CONFIG_CPUSETS
/*
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b00d8e3c30ba..2d2fc74bc00c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -254,6 +254,13 @@ void cpuset_unlock(void)
mutex_unlock(&cpuset_mutex);
}
+#ifdef CONFIG_LOCKDEP
+bool lockdep_is_cpuset_held(void)
+{
+ return lockdep_is_held(&cpuset_mutex);
+}
+#endif
+
static DEFINE_SPINLOCK(callback_lock);
void cpuset_callback_lock_irq(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 14/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
[not found] <20250829154814.47015-1-frederic@kernel.org>
2025-08-29 15:47 ` [PATCH 07/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 12/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
@ 2025-08-29 15:47 ` Frederic Weisbecker
2025-09-01 0:40 ` Waiman Long
2025-08-29 15:47 ` [PATCH 15/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
` (4 subsequent siblings)
7 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:47 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Ingo Molnar,
Johannes Weiner, Marco Crivellari, Michal Hocko, Peter Zijlstra,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
There are many drawbacks to that approach:
1) Most interested subsystems want to know about all isolated CPUs, not
just those defined on boot time.
2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
concurrent cpuset changes.
3) Further cpuset modifications are not propagated to subsystems
Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.
Subsystems can rely on RCU to synchronize against concurrent changes.
The propagation mentioned in 3) will be handled in further patches.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/sched/isolation.h | 4 +-
kernel/cgroup/cpuset.c | 2 +
kernel/sched/isolation.c | 65 ++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 1 +
4 files changed, 65 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 9262378760b1..199d0fc4646f 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -36,12 +36,13 @@ extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
static inline bool housekeeping_cpu(int cpu, enum hk_type type)
{
- if (housekeeping_flags & BIT(type))
+ if (READ_ONCE(housekeeping_flags) & BIT(type))
return housekeeping_test_cpu(cpu, type);
else
return true;
}
+extern int housekeeping_update(struct cpumask *mask, enum hk_type type);
extern void __init housekeeping_init(void);
#else
@@ -74,6 +75,7 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
return true;
}
+static inline int housekeeping_update(struct cpumask *mask, enum hk_type type) { return 0; }
static inline void housekeeping_init(void) { }
#endif /* CONFIG_CPU_ISOLATION */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2d2fc74bc00c..4f2bc68332a7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1351,6 +1351,8 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
WARN_ON_ONCE(ret < 0);
+ ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
+ WARN_ON_ONCE(ret < 0);
}
/**
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5ddb8dc5ca91..48f3b6b20604 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -23,16 +23,39 @@ EXPORT_SYMBOL_GPL(housekeeping_flags);
bool housekeeping_enabled(enum hk_type type)
{
- return !!(housekeeping_flags & BIT(type));
+ return !!(READ_ONCE(housekeeping_flags) & BIT(type));
}
EXPORT_SYMBOL_GPL(housekeeping_enabled);
+static bool housekeeping_dereference_check(enum hk_type type)
+{
+ if (type == HK_TYPE_DOMAIN) {
+ if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
+ return true;
+ if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
+ return true;
+
+ return false;
+ }
+
+ return true;
+}
+
+static inline struct cpumask *__housekeeping_cpumask(enum hk_type type)
+{
+ return rcu_dereference_check(housekeeping_cpumasks[type],
+ housekeeping_dereference_check(type));
+}
+
const struct cpumask *housekeeping_cpumask(enum hk_type type)
{
- if (housekeeping_flags & BIT(type)) {
- return rcu_dereference_check(housekeeping_cpumasks[type], 1);
- }
- return cpu_possible_mask;
+ const struct cpumask *mask = NULL;
+
+ if (READ_ONCE(housekeeping_flags) & BIT(type))
+ mask = __housekeeping_cpumask(type);
+ if (!mask)
+ mask = cpu_possible_mask;
+ return mask;
}
EXPORT_SYMBOL_GPL(housekeeping_cpumask);
@@ -70,12 +93,42 @@ EXPORT_SYMBOL_GPL(housekeeping_affine);
bool housekeeping_test_cpu(int cpu, enum hk_type type)
{
- if (housekeeping_flags & BIT(type))
+ if (READ_ONCE(housekeeping_flags) & BIT(type))
return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
return true;
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
+int housekeeping_update(struct cpumask *mask, enum hk_type type)
+{
+ struct cpumask *trial, *old = NULL;
+
+ if (type != HK_TYPE_DOMAIN)
+ return -ENOTSUPP;
+
+ trial = kmalloc(sizeof(*trial), GFP_KERNEL);
+ if (!trial)
+ return -ENOMEM;
+
+ cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), mask);
+ if (!cpumask_intersects(trial, cpu_online_mask)) {
+ kfree(trial);
+ return -EINVAL;
+ }
+
+ if (housekeeping_flags & BIT(type))
+ old = __housekeeping_cpumask(type);
+ else
+ WRITE_ONCE(housekeeping_flags, housekeeping_flags | BIT(type));
+ rcu_assign_pointer(housekeeping_cpumasks[type], trial);
+
+ synchronize_rcu();
+
+ kfree(old);
+
+ return 0;
+}
+
void __init housekeeping_init(void)
{
enum hk_type type;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b1a233dcabf..d3512138027b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -30,6 +30,7 @@
#include <linux/context_tracking.h>
#include <linux/cpufreq.h>
#include <linux/cpumask_api.h>
+#include <linux/cpuset.h>
#include <linux/ctype.h>
#include <linux/file.h>
#include <linux/fs_api.h>
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 15/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change
[not found] <20250829154814.47015-1-frederic@kernel.org>
` (2 preceding siblings ...)
2025-08-29 15:47 ` [PATCH 14/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
@ 2025-08-29 15:47 ` Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
` (3 subsequent siblings)
7 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:47 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Johannes Weiner,
Marco Crivellari, Michal Hocko, Muchun Song, Peter Zijlstra,
Roman Gushchin, Shakeel Butt, Tejun Heo, Thomas Gleixner,
Vlastimil Babka, Waiman Long, cgroups, linux-mm
The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.
However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/memcontrol.h | 4 ++++
kernel/sched/isolation.c | 2 ++
kernel/sched/sched.h | 1 +
mm/memcontrol.c | 12 +++++++++++-
4 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 785173aa0739..8b23ff000473 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1048,6 +1048,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
return id;
}
+void mem_cgroup_flush_workqueue(void);
+
extern int mem_cgroup_init(void);
#else /* CONFIG_MEMCG */
@@ -1453,6 +1455,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
return 0;
}
+static inline void mem_cgroup_flush_workqueue(void) { }
+
static inline int mem_cgroup_init(void) { return 0; }
#endif /* CONFIG_MEMCG */
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 48f3b6b20604..e85f402b103a 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -124,6 +124,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
synchronize_rcu();
+ mem_cgroup_flush_workqueue();
+
kfree(old);
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3512138027b..1dad1ac7fc61 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
#include <linux/lockdep_api.h>
#include <linux/lockdep.h>
#include <linux/memblock.h>
+#include <linux/memcontrol.h>
#include <linux/minmax.h>
#include <linux/mm.h>
#include <linux/module.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2649d6c09160..1aa2dfa32ccd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,8 @@ static bool cgroup_memory_nokmem __ro_after_init;
/* BPF memory accounting disabled? */
static bool cgroup_memory_nobpf __ro_after_init;
+static struct workqueue_struct *memcg_wq __ro_after_init;
+
static struct kmem_cache *memcg_cachep;
static struct kmem_cache *memcg_pn_cachep;
@@ -1974,7 +1976,7 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
{
guard(rcu)();
if (!cpu_is_isolated(cpu))
- schedule_work_on(cpu, work);
+ queue_work_on(cpu, memcg_wq, work);
}
/*
@@ -5071,6 +5073,11 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
refill_stock(memcg, nr_pages);
}
+void mem_cgroup_flush_workqueue(void)
+{
+ flush_workqueue(memcg_wq);
+}
+
static int __init cgroup_memory(char *s)
{
char *token;
@@ -5113,6 +5120,9 @@ int __init mem_cgroup_init(void)
cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
memcg_hotplug_cpu_dead);
+ memcg_wq = alloc_workqueue("memcg", 0, 0);
+ WARN_ON(!memcg_wq);
+
for_each_possible_cpu(cpu) {
INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
drain_local_memcg_stock);
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping
[not found] <20250829154814.47015-1-frederic@kernel.org>
` (3 preceding siblings ...)
2025-08-29 15:47 ` [PATCH 15/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
@ 2025-08-29 15:47 ` Frederic Weisbecker
2025-09-01 2:51 ` Waiman Long
2025-08-29 15:47 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
` (2 subsequent siblings)
7 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:47 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Ingo Molnar,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
Waiman Long, cgroups
Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.
Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/workqueue.h | 2 +-
init/Kconfig | 1 +
kernel/cgroup/cpuset.c | 14 ++++++--------
kernel/sched/isolation.c | 4 +++-
kernel/workqueue.c | 2 +-
5 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 45d5dd470ff6..19fee865ce2a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -588,7 +588,7 @@ struct workqueue_attrs *alloc_workqueue_attrs_noprof(void);
void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs);
-extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
+extern int workqueue_unbound_exclude_cpumask(const struct cpumask *cpumask);
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
diff --git a/init/Kconfig b/init/Kconfig
index 836320251219..af05cf89db12 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1230,6 +1230,7 @@ config CPUSETS
bool "Cpuset controller"
depends on SMP
select UNION_FIND
+ select CPU_ISOLATION
help
This option will let you create and manage CPUSETs which
allow dynamically partitioning a system into sets of CPUs and
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4f2bc68332a7..eb8d01d23af6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1340,7 +1340,7 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
return isolcpus_updated;
}
-static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
+static void update_housekeeping_cpumask(bool isolcpus_updated)
{
int ret;
@@ -1349,8 +1349,6 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
if (!isolcpus_updated)
return;
- ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
- WARN_ON_ONCE(ret < 0);
ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
WARN_ON_ONCE(ret < 0);
}
@@ -1473,7 +1471,7 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
list_add(&cs->remote_sibling, &remote_children);
cpumask_copy(cs->effective_xcpus, tmp->new_cpus);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
cpuset_force_rebuild();
cs->prs_err = 0;
@@ -1514,7 +1512,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
compute_effective_exclusive_cpumask(cs, NULL, NULL);
reset_partition_data(cs);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
cpuset_force_rebuild();
/*
@@ -1583,7 +1581,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
if (xcpus)
cpumask_copy(cs->exclusive_cpus, xcpus);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
if (adding || deleting)
cpuset_force_rebuild();
@@ -1947,7 +1945,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
WARN_ON_ONCE(parent->nr_subparts < 0);
}
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
if ((old_prs != new_prs) && (cmd == partcmd_update))
update_partition_exclusive_flag(cs, new_prs);
@@ -2972,7 +2970,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
else if (isolcpus_updated)
isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
/* Force update if switching back to member & update effective_xcpus */
update_cpumasks_hier(cs, &tmpmask, !new_prs);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 86ce39aa1e9f..5baf1621a56e 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -102,6 +102,7 @@ EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
int housekeeping_update(struct cpumask *mask, enum hk_type type)
{
struct cpumask *trial, *old = NULL;
+ int err;
if (type != HK_TYPE_DOMAIN)
return -ENOTSUPP;
@@ -126,10 +127,11 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
mem_cgroup_flush_workqueue();
vmstat_flush_workqueue();
+ err = workqueue_unbound_exclude_cpumask(housekeeping_cpumask(type));
kfree(old);
- return 0;
+ return err;
}
void __init housekeeping_init(void)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c6b79b3675c3..63dcc1d8b317 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6921,7 +6921,7 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
* This function can be called from cpuset code to provide a set of isolated
* CPUs that should be excluded from wq_unbound_cpumask.
*/
-int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
+int workqueue_unbound_exclude_cpumask(const struct cpumask *exclude_cpumask)
{
cpumask_var_t cpumask;
int ret = 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated()
[not found] <20250829154814.47015-1-frederic@kernel.org>
` (4 preceding siblings ...)
2025-08-29 15:47 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
@ 2025-08-29 15:47 ` Frederic Weisbecker
2025-08-29 15:48 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
2025-08-29 15:48 ` [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
7 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:47 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Johannes Weiner,
Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups
The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
housekeeping cpumask. There is no usecase left interested in just
checking what is isolated by cpuset and not by the isolcpus= kernel
boot parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/cpuset.h | 6 ------
include/linux/sched/isolation.h | 3 +--
kernel/cgroup/cpuset.c | 12 ------------
3 files changed, 1 insertion(+), 20 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 051d36fec578..a10775a4f702 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -78,7 +78,6 @@ extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
-extern bool cpuset_cpu_is_isolated(int cpu);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -208,11 +207,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
return false;
}
-static inline bool cpuset_cpu_is_isolated(int cpu)
-{
- return false;
-}
-
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
return node_possible_map;
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 199d0fc4646f..c02923ed4cbe 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -83,8 +83,7 @@ static inline void housekeeping_init(void) { }
static inline bool cpu_is_isolated(int cpu)
{
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
- !housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
- cpuset_cpu_is_isolated(cpu);
+ !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
}
#endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index eb8d01d23af6..df1dfacf5f9d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -29,7 +29,6 @@
#include <linux/mempolicy.h>
#include <linux/mm.h>
#include <linux/memory.h>
-#include <linux/export.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/sched/deadline.h>
@@ -1353,17 +1352,6 @@ static void update_housekeeping_cpumask(bool isolcpus_updated)
WARN_ON_ONCE(ret < 0);
}
-/**
- * cpuset_cpu_is_isolated - Check if the given CPU is isolated
- * @cpu: the CPU number to be checked
- * Return: true if CPU is used in an isolated partition, false otherwise
- */
-bool cpuset_cpu_is_isolated(int cpu)
-{
- return cpumask_test_cpu(cpu, isolated_cpus);
-}
-EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
-
/*
* compute_effective_exclusive_cpumask - compute effective exclusive CPUs
* @cs: cpuset
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
[not found] <20250829154814.47015-1-frederic@kernel.org>
` (5 preceding siblings ...)
2025-08-29 15:47 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
@ 2025-08-29 15:48 ` Frederic Weisbecker
2025-09-02 15:44 ` Waiman Long
2025-08-29 15:48 ` [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
7 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:48 UTC (permalink / raw)
To: LKML
Cc: Gabriele Monaco, Johannes Weiner, Marco Crivellari, Michal Hocko,
Michal Koutný, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
Waiman Long, cgroups, Frederic Weisbecker
From: Gabriele Monaco <gmonaco@redhat.com>
Currently the user can set up isolated cpus via cpuset and nohz_full in
such a way that leaves no housekeeping CPU (i.e. no CPU that is neither
domain isolated nor nohz full). This can be a problem for other
subsystems (e.g. the timer wheel imgration).
Prevent this configuration by blocking any assignation that would cause
the union of domain isolated cpus and nohz_full to covers all CPUs.
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/cgroup/cpuset.c | 57 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index df1dfacf5f9d..8260dd699fd8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1275,6 +1275,19 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
}
+/*
+ * isolated_cpus_should_update - Returns if the isolated_cpus mask needs update
+ * @prs: new or old partition_root_state
+ * @parent: parent cpuset
+ * Return: true if isolated_cpus needs modification, false otherwise
+ */
+static bool isolated_cpus_should_update(int prs, struct cpuset *parent)
+{
+ if (!parent)
+ parent = &top_cpuset;
+ return prs != parent->partition_root_state;
+}
+
/*
* partition_xcpus_add - Add new exclusive CPUs to partition
* @new_prs: new partition_root_state
@@ -1339,6 +1352,36 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
return isolcpus_updated;
}
+/*
+ * isolcpus_nohz_conflict - check for isolated & nohz_full conflicts
+ * @new_cpus: cpu mask for cpus that are going to be isolated
+ * Return: true if there is conflict, false otherwise
+ *
+ * If nohz_full is enabled and we have isolated CPUs, their combination must
+ * still leave housekeeping CPUs.
+ */
+static bool isolcpus_nohz_conflict(struct cpumask *new_cpus)
+{
+ cpumask_var_t full_hk_cpus;
+ int res = false;
+
+ if (!housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+ return false;
+
+ if (!alloc_cpumask_var(&full_hk_cpus, GFP_KERNEL))
+ return true;
+
+ cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+ housekeeping_cpumask(HK_TYPE_DOMAIN));
+ cpumask_andnot(full_hk_cpus, full_hk_cpus, isolated_cpus);
+ cpumask_and(full_hk_cpus, full_hk_cpus, cpu_online_mask);
+ if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
+ res = true;
+
+ free_cpumask_var(full_hk_cpus);
+ return res;
+}
+
static void update_housekeeping_cpumask(bool isolcpus_updated)
{
int ret;
@@ -1453,6 +1496,9 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
if (!cpumask_intersects(tmp->new_cpus, cpu_active_mask) ||
cpumask_subset(top_cpuset.effective_cpus, tmp->new_cpus))
return PERR_INVCPUS;
+ if (isolated_cpus_should_update(new_prs, NULL) &&
+ isolcpus_nohz_conflict(tmp->new_cpus))
+ return PERR_HKEEPING;
spin_lock_irq(&callback_lock);
isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
@@ -1552,6 +1598,9 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
else if (cpumask_intersects(tmp->addmask, subpartitions_cpus) ||
cpumask_subset(top_cpuset.effective_cpus, tmp->addmask))
cs->prs_err = PERR_NOCPUS;
+ else if (isolated_cpus_should_update(prs, NULL) &&
+ isolcpus_nohz_conflict(tmp->addmask))
+ cs->prs_err = PERR_HKEEPING;
if (cs->prs_err)
goto invalidate;
}
@@ -1904,6 +1953,12 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
return err;
}
+ if (deleting && isolated_cpus_should_update(new_prs, parent) &&
+ isolcpus_nohz_conflict(tmp->delmask)) {
+ cs->prs_err = PERR_HKEEPING;
+ return PERR_HKEEPING;
+ }
+
/*
* Change the parent's effective_cpus & effective_xcpus (top cpuset
* only).
@@ -2924,6 +2979,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* Need to update isolated_cpus.
*/
isolcpus_updated = true;
+ if (isolcpus_nohz_conflict(cs->effective_xcpus))
+ err = PERR_HKEEPING;
} else {
/*
* Switching back to member is always allowed even if it
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes
[not found] <20250829154814.47015-1-frederic@kernel.org>
` (6 preceding siblings ...)
2025-08-29 15:48 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
@ 2025-08-29 15:48 ` Frederic Weisbecker
7 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2025-08-29 15:48 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Ingo Molnar, Johannes Weiner,
Marco Crivellari, Michal Hocko, Michal Koutný,
Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
Waiman Long, cgroups
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.
For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.
Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/kthread.h | 1 +
kernel/cgroup/cpuset.c | 5 ++---
kernel/kthread.c | 38 +++++++++++++++++++++++++++++---------
kernel/sched/isolation.c | 2 ++
4 files changed, 34 insertions(+), 12 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 8d27403888ce..c92c1149ee6e 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -100,6 +100,7 @@ void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
void kthread_exit(long result) __noreturn;
void kthread_complete_and_exit(struct completion *, long) __noreturn;
+int kthreads_update_housekeeping(void);
int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cf99ea844c1d..e76711fa7d34 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1130,11 +1130,10 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
if (top_cs) {
/*
+ * PF_KTHREAD tasks are handled by housekeeping.
* PF_NO_SETAFFINITY tasks are ignored.
- * All per cpu kthreads should have PF_NO_SETAFFINITY
- * flag set, see kthread_set_per_cpu().
*/
- if (task->flags & PF_NO_SETAFFINITY)
+ if (task->flags & (PF_KTHREAD | PF_NO_SETAFFINITY))
continue;
cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
} else {
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8d0c8c4c7e46..4d3cc04e5e8b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -896,14 +896,7 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
}
EXPORT_SYMBOL_GPL(kthread_affine_preferred);
-/*
- * Re-affine kthreads according to their preferences
- * and the newly online CPU. The CPU down part is handled
- * by select_fallback_rq() which default re-affines to
- * housekeepers from other nodes in case the preferred
- * affinity doesn't apply anymore.
- */
-static int kthreads_online_cpu(unsigned int cpu)
+static int kthreads_update_affinity(bool force)
{
cpumask_var_t affinity;
struct kthread *k;
@@ -926,7 +919,7 @@ static int kthreads_online_cpu(unsigned int cpu)
continue;
}
- if (k->preferred_affinity || k->node != NUMA_NO_NODE) {
+ if (force || k->preferred_affinity || k->node != NUMA_NO_NODE) {
kthread_fetch_affinity(k, affinity);
set_cpus_allowed_ptr(k->task, affinity);
}
@@ -937,6 +930,33 @@ static int kthreads_online_cpu(unsigned int cpu)
return ret;
}
+/**
+ * kthreads_update_housekeeping - Update kthreads affinity on cpuset change
+ *
+ * When cpuset changes a partition type to/from "isolated" or updates related
+ * cpumasks, propagate the housekeeping cpumask change to preferred kthreads
+ * affinity.
+ *
+ * Returns 0 if successful, -ENOMEM if temporary mask couldn't
+ * be allocated or -EINVAL in case of internal error.
+ */
+int kthreads_update_housekeeping(void)
+{
+ return kthreads_update_affinity(true);
+}
+
+/*
+ * Re-affine kthreads according to their preferences
+ * and the newly online CPU. The CPU down part is handled
+ * by select_fallback_rq() which default re-affines to
+ * housekeepers from other nodes in case the preferred
+ * affinity doesn't apply anymore.
+ */
+static int kthreads_online_cpu(unsigned int cpu)
+{
+ return kthreads_update_affinity(false);
+}
+
static int kthreads_init(void)
{
return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online",
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5baf1621a56e..51392eb9b221 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -128,6 +128,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
mem_cgroup_flush_workqueue();
vmstat_flush_workqueue();
err = workqueue_unbound_exclude_cpumask(housekeeping_cpumask(type));
+ WARN_ON_ONCE(err < 0);
+ err = kthreads_update_housekeeping();
kfree(old);
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 14/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
2025-08-29 15:47 ` [PATCH 14/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
@ 2025-09-01 0:40 ` Waiman Long
0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2025-09-01 0:40 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Ingo Molnar, Johannes Weiner,
Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
Thomas Gleixner, Vlastimil Babka, cgroups
On 8/29/25 11:47 AM, Frederic Weisbecker wrote:
> Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
> CPUs passed through isolcpus= boot option. Users interested in also
> knowing the runtime defined isolated CPUs through cpuset must use
> different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
>
> There are many drawbacks to that approach:
>
> 1) Most interested subsystems want to know about all isolated CPUs, not
> just those defined on boot time.
>
> 2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
> concurrent cpuset changes.
>
> 3) Further cpuset modifications are not propagated to subsystems
>
> Solve 1) and 2) and centralize all isolated CPUs within the
> HK_TYPE_DOMAIN housekeeping cpumask.
>
> Subsystems can rely on RCU to synchronize against concurrent changes.
>
> The propagation mentioned in 3) will be handled in further patches.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/sched/isolation.h | 4 +-
> kernel/cgroup/cpuset.c | 2 +
> kernel/sched/isolation.c | 65 ++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 1 +
> 4 files changed, 65 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 5ddb8dc5ca91..48f3b6b20604 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -23,16 +23,39 @@ EXPORT_SYMBOL_GPL(housekeeping_flags);
>
> bool housekeeping_enabled(enum hk_type type)
> {
> - return !!(housekeeping_flags & BIT(type));
> + return !!(READ_ONCE(housekeeping_flags) & BIT(type));
> }
> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>
> +static bool housekeeping_dereference_check(enum hk_type type)
> +{
> + if (type == HK_TYPE_DOMAIN) {
> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
> + return true;
> + if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
> + return true;
> +
> + return false;
> + }
> +
> + return true;
> +}
Both lockdep_is_cpuset_held() and lockdep_is_cpus_write_held() may be
defined only if CONFIG_LOCKDEP is set. However, this function is
currently referenced by __housekeeping_cpumask() via RCU_LOCKDEP_WARN().
So it is not invoked if CONFIG_LOCKDEP is not set. You are assuming that
static function not referenced is not being compiled into the object
file. Should we bracket it with "ifdef CONFIG_LOCKDEP" just to make this
clear?
> +
> +static inline struct cpumask *__housekeeping_cpumask(enum hk_type type)
> +{
> + return rcu_dereference_check(housekeeping_cpumasks[type],
> + housekeeping_dereference_check(type));
> +}
> +
> const struct cpumask *housekeeping_cpumask(enum hk_type type)
> {
> - if (housekeeping_flags & BIT(type)) {
> - return rcu_dereference_check(housekeeping_cpumasks[type], 1);
> - }
> - return cpu_possible_mask;
> + const struct cpumask *mask = NULL;
> +
> + if (READ_ONCE(housekeeping_flags) & BIT(type))
> + mask = __housekeeping_cpumask(type);
> + if (!mask)
> + mask = cpu_possible_mask;
> + return mask;
> }
> EXPORT_SYMBOL_GPL(housekeeping_cpumask);
>
> @@ -70,12 +93,42 @@ EXPORT_SYMBOL_GPL(housekeeping_affine);
>
> bool housekeeping_test_cpu(int cpu, enum hk_type type)
> {
> - if (housekeeping_flags & BIT(type))
> + if (READ_ONCE(housekeeping_flags) & BIT(type))
> return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
> return true;
> }
> EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
>
> +int housekeeping_update(struct cpumask *mask, enum hk_type type)
> +{
> + struct cpumask *trial, *old = NULL;
> +
> + if (type != HK_TYPE_DOMAIN)
> + return -ENOTSUPP;
> +
> + trial = kmalloc(sizeof(*trial), GFP_KERNEL);
> + if (!trial)
> + return -ENOMEM;
> +
> + cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), mask);
> + if (!cpumask_intersects(trial, cpu_online_mask)) {
> + kfree(trial);
> + return -EINVAL;
> + }
> +
> + if (housekeeping_flags & BIT(type))
> + old = __housekeeping_cpumask(type);
> + else
> + WRITE_ONCE(housekeeping_flags, housekeeping_flags | BIT(type));
Should we use to READ_ONCE() to retrieve the current housekeeping_flags
value?
Cheers,
Longman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping
2025-08-29 15:47 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
@ 2025-09-01 2:51 ` Waiman Long
0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2025-09-01 2:51 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Ingo Molnar, Johannes Weiner, Lai Jiangshan,
Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
Thomas Gleixner, Vlastimil Babka, cgroups
On 8/29/25 11:47 AM, Frederic Weisbecker wrote:
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -102,6 +102,7 @@ EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
> int housekeeping_update(struct cpumask *mask, enum hk_type type)
> {
> struct cpumask *trial, *old = NULL;
> + int err;
>
> if (type != HK_TYPE_DOMAIN)
> return -ENOTSUPP;
> @@ -126,10 +127,11 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
>
> mem_cgroup_flush_workqueue();
> vmstat_flush_workqueue();
> + err = workqueue_unbound_exclude_cpumask(housekeeping_cpumask(type));
>
> kfree(old);
>
> - return 0;
> + return err;
> }
Actually workqueue_unbound_exclude_cpumask() expects a cpumask of all
the CPUs that have been isolated. IOW, all the CPUs that are not in
housekeeping_cpumask(HK_TYPE_DOMAIN). So we do the inversion here or we
rename the function to, e.g. workqueue_unbound_cpumask_update() and make
the change there.
Cheers,
Longman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
2025-08-29 15:48 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
@ 2025-09-02 15:44 ` Waiman Long
0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2025-09-02 15:44 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Gabriele Monaco, Johannes Weiner, Marco Crivellari, Michal Hocko,
Michal Koutný, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
cgroups
On 8/29/25 11:48 AM, Frederic Weisbecker wrote:
> From: Gabriele Monaco <gmonaco@redhat.com>
>
> Currently the user can set up isolated cpus via cpuset and nohz_full in
> such a way that leaves no housekeeping CPU (i.e. no CPU that is neither
> domain isolated nor nohz full). This can be a problem for other
> subsystems (e.g. the timer wheel imgration).
>
> Prevent this configuration by blocking any assignation that would cause
> the union of domain isolated cpus and nohz_full to covers all CPUs.
>
> Acked-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/cgroup/cpuset.c | 57 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 57 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index df1dfacf5f9d..8260dd699fd8 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1275,6 +1275,19 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
> cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
> }
>
> +/*
> + * isolated_cpus_should_update - Returns if the isolated_cpus mask needs update
> + * @prs: new or old partition_root_state
> + * @parent: parent cpuset
> + * Return: true if isolated_cpus needs modification, false otherwise
> + */
> +static bool isolated_cpus_should_update(int prs, struct cpuset *parent)
> +{
> + if (!parent)
> + parent = &top_cpuset;
> + return prs != parent->partition_root_state;
> +}
> +
> /*
> * partition_xcpus_add - Add new exclusive CPUs to partition
> * @new_prs: new partition_root_state
> @@ -1339,6 +1352,36 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
> return isolcpus_updated;
> }
>
> +/*
> + * isolcpus_nohz_conflict - check for isolated & nohz_full conflicts
> + * @new_cpus: cpu mask for cpus that are going to be isolated
> + * Return: true if there is conflict, false otherwise
> + *
> + * If nohz_full is enabled and we have isolated CPUs, their combination must
> + * still leave housekeeping CPUs.
> + */
> +static bool isolcpus_nohz_conflict(struct cpumask *new_cpus)
> +{
> + cpumask_var_t full_hk_cpus;
> + int res = false;
> +
> + if (!housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
> + return false;
> +
> + if (!alloc_cpumask_var(&full_hk_cpus, GFP_KERNEL))
> + return true;
> +
> + cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
> + housekeeping_cpumask(HK_TYPE_DOMAIN));
> + cpumask_andnot(full_hk_cpus, full_hk_cpus, isolated_cpus);
> + cpumask_and(full_hk_cpus, full_hk_cpus, cpu_online_mask);
> + if (!cpumask_weight_andnot(full_hk_cpus, new_cpus))
> + res = true;
> +
> + free_cpumask_var(full_hk_cpus);
> + return res;
> +}
> +
> static void update_housekeeping_cpumask(bool isolcpus_updated)
> {
> int ret;
> @@ -1453,6 +1496,9 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
> if (!cpumask_intersects(tmp->new_cpus, cpu_active_mask) ||
> cpumask_subset(top_cpuset.effective_cpus, tmp->new_cpus))
> return PERR_INVCPUS;
> + if (isolated_cpus_should_update(new_prs, NULL) &&
> + isolcpus_nohz_conflict(tmp->new_cpus))
> + return PERR_HKEEPING;
>
> spin_lock_irq(&callback_lock);
> isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
> @@ -1552,6 +1598,9 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
> else if (cpumask_intersects(tmp->addmask, subpartitions_cpus) ||
> cpumask_subset(top_cpuset.effective_cpus, tmp->addmask))
> cs->prs_err = PERR_NOCPUS;
> + else if (isolated_cpus_should_update(prs, NULL) &&
> + isolcpus_nohz_conflict(tmp->addmask))
> + cs->prs_err = PERR_HKEEPING;
> if (cs->prs_err)
> goto invalidate;
> }
> @@ -1904,6 +1953,12 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
> return err;
> }
>
> + if (deleting && isolated_cpus_should_update(new_prs, parent) &&
> + isolcpus_nohz_conflict(tmp->delmask)) {
> + cs->prs_err = PERR_HKEEPING;
> + return PERR_HKEEPING;
> + }
> +
> /*
> * Change the parent's effective_cpus & effective_xcpus (top cpuset
> * only).
> @@ -2924,6 +2979,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
> * Need to update isolated_cpus.
> */
> isolcpus_updated = true;
> + if (isolcpus_nohz_conflict(cs->effective_xcpus))
> + err = PERR_HKEEPING;
> } else {
> /*
> * Switching back to member is always allowed even if it
In both remote_cpus_update() and update_parent_effective_cpumask(), some
new CPUs can be added to the isolation list while other CPUs can be
removed from it. So isolcpus_nohz_conflict() should include both set in
its analysis to avoid false positive. Essentally, if the CPUs removed
from the isolated_cpus intersect with the nohz_full housekeeping mask,
there is no conflict.
Cheers,
Longman
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-09-02 15:44 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250829154814.47015-1-frederic@kernel.org>
2025-08-29 15:47 ` [PATCH 07/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 12/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 14/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
2025-09-01 0:40 ` Waiman Long
2025-08-29 15:47 ` [PATCH 15/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
2025-08-29 15:47 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
2025-09-01 2:51 ` Waiman Long
2025-08-29 15:47 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
2025-08-29 15:48 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
2025-09-02 15:44 ` Waiman Long
2025-08-29 15:48 ` [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).