[PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity
@ 2025-06-20 15:22 Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 01/27] sched/isolation: Remove housekeeping static key Frederic Weisbecker
                   ` (27 more replies)
  0 siblings, 28 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

The kthread code was enhanced lately to provide an infrastructure which
manages the preferred affinity of unbound kthreads (node or custom
cpumask) against housekeeping constraints and CPU hotplug events.

One crucial missing piece is cpuset: when an isolated partition is
created, deleted, or its CPUs updated, all the unbound kthreads in the
top cpuset are affine to _all_ the non-isolated CPUs, possibly breaking
their preferred affinity along the way

Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured.

The dispatch of the new cpumasks to workqueues and kthreads is performed
by housekeeping, as per the nice Tejun's suggestion.

As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from isolcpus= and cpuset isolated partitions. Housekeeping cpumasks are
now modifyable with specific synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	kthread/core

HEAD: f43c8b542df665940c2f581d771d92ff50606a6e

Thanks,
	Frederic
---

Frederic Weisbecker (27):
      sched/isolation: Remove housekeeping static key
      sched/isolation: Introduce housekeeping per-cpu rwsem
      PCI: Protect against concurrent change of housekeeping cpumask
      cpu: Protect against concurrent isolated cpuset change
      memcg: Prepare to protect against concurrent isolated cpuset change
      mm: vmstat: Prepare to protect against concurrent isolated cpuset change
      sched/isolation: Save boot defined domain flags
      cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
      driver core: cpu: Convert /sys/devices/system/cpu/isolated to use HK_TYPE_DOMAIN_BOOT
      net: Keep ignoring isolated cpuset change
      block: Protect against concurrent isolated cpuset change
      cpu: Provide lockdep check for CPU hotplug lock write-held
      cpuset: Provide lockdep check for cpuset lock held
      sched/isolation: Convert housekeeping cpumasks to rcu pointers
      cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
      sched/isolation: Flush memcg workqueues on cpuset isolated partition change
      sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
      cpuset: Propagate cpuset isolation update to workqueue through housekeeping
      cpuset: Remove cpuset_cpu_is_isolated()
      sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
      kthread: Refine naming of affinity related fields
      kthread: Include unbound kthreads in the managed affinity list
      kthread: Include kthreadd to the managed affinity list
      kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
      sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
      kthread: Honour kthreads preferred affinity after cpuset changes
      kthread: Comment on the purpose and placement of kthread_affine_node() call


 block/blk-mq.c                  |   6 +-
 drivers/base/cpu.c              |   2 +-
 drivers/pci/pci-driver.c        |   3 +-
 include/linux/cpuhplock.h       |   1 +
 include/linux/cpuset.h          |   8 +-
 include/linux/kthread.h         |   1 +
 include/linux/memcontrol.h      |   4 +
 include/linux/mmu_context.h     |   2 +-
 include/linux/percpu-rwsem.h    |   1 +
 include/linux/sched/isolation.h |  38 +++++---
 include/linux/vmstat.h          |   2 +
 include/linux/workqueue.h       |   2 +-
 init/Kconfig                    |   1 +
 kernel/cgroup/cpuset.c          |  60 +++++-------
 kernel/cpu.c                    |  49 +++++++---
 kernel/kthread.c                | 136 ++++++++++++++++-----------
 kernel/sched/isolation.c        | 201 ++++++++++++++++++++++++++++------------
 kernel/sched/sched.h            |   5 +
 kernel/workqueue.c              |   2 +-
 mm/memcontrol.c                 |  26 +++++-
 mm/vmstat.c                     |  15 ++-
 net/core/net-sysfs.c            |   2 +-
 22 files changed, 371 insertions(+), 196 deletions(-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/27] sched/isolation: Remove housekeeping static key
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem Frederic Weisbecker
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

The housekeeping static key in its current use is mostly irrelevant.
Most of the time, a housekeeping function call had already been issued
before the static call got a chance to be evaluated, defeating the
initial call optimization purpose.

housekeeping_cpu() is the sole correct user performing the static call
before the actual slow-path function call. But it's seldom used in
fast-path.

Finally the static call prevents from synchronizing correctly against
dynamic updates of the housekeeping cpumasks through cpusets.

Get away with a simple flag test instead.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched/isolation.h | 25 +++++----
 kernel/sched/isolation.c        | 90 ++++++++++++++-------------------
 2 files changed, 55 insertions(+), 60 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b5..f98ba0d71c52 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -25,12 +25,22 @@ enum hk_type {
 };
 
 #ifdef CONFIG_CPU_ISOLATION
-DECLARE_STATIC_KEY_FALSE(housekeeping_overridden);
+extern unsigned long housekeeping_flags;
+
 extern int housekeeping_any_cpu(enum hk_type type);
 extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
 extern bool housekeeping_enabled(enum hk_type type);
 extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
 extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
+
+static inline bool housekeeping_cpu(int cpu, enum hk_type type)
+{
+	if (housekeeping_flags & BIT(type))
+		return housekeeping_test_cpu(cpu, type);
+	else
+		return true;
+}
+
 extern void __init housekeeping_init(void);
 
 #else
@@ -58,17 +68,14 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
 	return true;
 }
 
+static inline bool housekeeping_cpu(int cpu, enum hk_type type)
+{
+	return true;
+}
+
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_CPU_ISOLATION */
 
-static inline bool housekeeping_cpu(int cpu, enum hk_type type)
-{
-#ifdef CONFIG_CPU_ISOLATION
-	if (static_branch_unlikely(&housekeeping_overridden))
-		return housekeeping_test_cpu(cpu, type);
-#endif
-	return true;
-}
 
 static inline bool cpu_is_isolated(int cpu)
 {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 93b038d48900..83cec3853864 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -14,19 +14,13 @@ enum hk_flags {
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
 };
 
-DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
-EXPORT_SYMBOL_GPL(housekeeping_overridden);
-
-struct housekeeping {
-	cpumask_var_t cpumasks[HK_TYPE_MAX];
-	unsigned long flags;
-};
-
-static struct housekeeping housekeeping;
+static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
+unsigned long housekeeping_flags;
+EXPORT_SYMBOL_GPL(housekeeping_flags);
 
 bool housekeeping_enabled(enum hk_type type)
 {
-	return !!(housekeeping.flags & BIT(type));
+	return !!(housekeeping_flags & BIT(type));
 }
 EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
@@ -34,50 +28,46 @@ int housekeeping_any_cpu(enum hk_type type)
 {
 	int cpu;
 
-	if (static_branch_unlikely(&housekeeping_overridden)) {
-		if (housekeeping.flags & BIT(type)) {
-			cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
-			if (cpu < nr_cpu_ids)
-				return cpu;
+	if (housekeeping_flags & BIT(type)) {
+		cpu = sched_numa_find_closest(housekeeping_cpumasks[type], smp_processor_id());
+		if (cpu < nr_cpu_ids)
+			return cpu;
 
-			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
-			if (likely(cpu < nr_cpu_ids))
-				return cpu;
-			/*
-			 * Unless we have another problem this can only happen
-			 * at boot time before start_secondary() brings the 1st
-			 * housekeeping CPU up.
-			 */
-			WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
-				     type != HK_TYPE_TIMER);
-		}
+		cpu = cpumask_any_and_distribute(housekeeping_cpumasks[type], cpu_online_mask);
+		if (likely(cpu < nr_cpu_ids))
+			return cpu;
+		/*
+		 * Unless we have another problem this can only happen
+		 * at boot time before start_secondary() brings the 1st
+		 * housekeeping CPU up.
+		 */
+		WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
+			     type != HK_TYPE_TIMER);
 	}
+
 	return smp_processor_id();
 }
 EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
 
 const struct cpumask *housekeeping_cpumask(enum hk_type type)
 {
-	if (static_branch_unlikely(&housekeeping_overridden))
-		if (housekeeping.flags & BIT(type))
-			return housekeeping.cpumasks[type];
+	if (housekeeping_flags & BIT(type))
+		return housekeeping_cpumasks[type];
 	return cpu_possible_mask;
 }
 EXPORT_SYMBOL_GPL(housekeeping_cpumask);
 
 void housekeeping_affine(struct task_struct *t, enum hk_type type)
 {
-	if (static_branch_unlikely(&housekeeping_overridden))
-		if (housekeeping.flags & BIT(type))
-			set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]);
+	if (housekeeping_flags & BIT(type))
+		set_cpus_allowed_ptr(t, housekeeping_cpumasks[type]);
 }
 EXPORT_SYMBOL_GPL(housekeeping_affine);
 
 bool housekeeping_test_cpu(int cpu, enum hk_type type)
 {
-	if (static_branch_unlikely(&housekeeping_overridden))
-		if (housekeeping.flags & BIT(type))
-			return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]);
+	if (housekeeping_flags & BIT(type))
+		return cpumask_test_cpu(cpu, housekeeping_cpumasks[type]);
 	return true;
 }
 EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
@@ -86,17 +76,15 @@ void __init housekeeping_init(void)
 {
 	enum hk_type type;
 
-	if (!housekeeping.flags)
+	if (!housekeeping_flags)
 		return;
 
-	static_branch_enable(&housekeeping_overridden);
-
-	if (housekeeping.flags & HK_FLAG_KERNEL_NOISE)
+	if (housekeeping_flags & HK_FLAG_KERNEL_NOISE)
 		sched_tick_offload_init();
 
-	for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
+	for_each_set_bit(type, &housekeeping_flags, HK_TYPE_MAX) {
 		/* We need at least one CPU to handle housekeeping work */
-		WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type]));
+		WARN_ON_ONCE(cpumask_empty(housekeeping_cpumasks[type]));
 	}
 }
 
@@ -104,8 +92,8 @@ static void __init housekeeping_setup_type(enum hk_type type,
 					   cpumask_var_t housekeeping_staging)
 {
 
-	alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]);
-	cpumask_copy(housekeeping.cpumasks[type],
+	alloc_bootmem_cpumask_var(&housekeeping_cpumasks[type]);
+	cpumask_copy(housekeeping_cpumasks[type],
 		     housekeeping_staging);
 }
 
@@ -115,7 +103,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	unsigned int first_cpu;
 	int err = 0;
 
-	if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE)) {
+	if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping_flags & HK_FLAG_KERNEL_NOISE)) {
 		if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) {
 			pr_warn("Housekeeping: nohz unsupported."
 				" Build with CONFIG_NO_HZ_FULL\n");
@@ -137,7 +125,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
 		__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
 		__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
-		if (!housekeeping.flags) {
+		if (!housekeeping_flags) {
 			pr_warn("Housekeeping: must include one present CPU, "
 				"using boot CPU:%d\n", smp_processor_id());
 		}
@@ -146,7 +134,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	if (cpumask_empty(non_housekeeping_mask))
 		goto free_housekeeping_staging;
 
-	if (!housekeeping.flags) {
+	if (!housekeeping_flags) {
 		/* First setup call ("nohz_full=" or "isolcpus=") */
 		enum hk_type type;
 
@@ -155,26 +143,26 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	} else {
 		/* Second setup call ("nohz_full=" after "isolcpus=" or the reverse) */
 		enum hk_type type;
-		unsigned long iter_flags = flags & housekeeping.flags;
+		unsigned long iter_flags = flags & housekeeping_flags;
 
 		for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) {
 			if (!cpumask_equal(housekeeping_staging,
-					   housekeeping.cpumasks[type])) {
+					   housekeeping_cpumasks[type])) {
 				pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
 				goto free_housekeeping_staging;
 			}
 		}
 
-		iter_flags = flags & ~housekeeping.flags;
+		iter_flags = flags & ~housekeeping_flags;
 
 		for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
 			housekeeping_setup_type(type, housekeeping_staging);
 	}
 
-	if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
+	if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping_flags & HK_FLAG_KERNEL_NOISE))
 		tick_nohz_full_setup(non_housekeeping_mask);
 
-	housekeeping.flags |= flags;
+	housekeeping_flags |= flags;
 	err = 1;
 
 free_housekeeping_staging:
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 01/27] sched/isolation: Remove housekeeping static key Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-23 17:34   ` Waiman Long
  2025-06-20 15:22 ` [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask Frederic Weisbecker
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

The HK_TYPE_DOMAIN isolation cpumask, and further the
HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
future.

The affected subsystems will need to synchronize against those cpumask
changes so that:

* The reader get a coherent snapshot
* The housekeeping subsystem can safely propagate a cpumask update to
  the susbsytems after it has been published.

Protect against readsides that can sleep with per-cpu rwsem. Updates are
expected to be very rare given that CPU isolation is a niche usecase and
related cpuset setup happen only in preparation work. On the other hand
read sides can occur in more frequent paths.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched/isolation.h |  7 +++++++
 kernel/sched/isolation.c        | 12 ++++++++++++
 kernel/sched/sched.h            |  1 +
 3 files changed, 20 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index f98ba0d71c52..8de4f625a5c1 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -41,6 +41,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 		return true;
 }
 
+extern void housekeeping_lock(void);
+extern void housekeeping_unlock(void);
+
 extern void __init housekeeping_init(void);
 
 #else
@@ -73,6 +76,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 	return true;
 }
 
+static inline void housekeeping_lock(void) { }
+static inline void housekeeping_unlock(void) { }
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_CPU_ISOLATION */
 
@@ -84,4 +89,6 @@ static inline bool cpu_is_isolated(int cpu)
 	       cpuset_cpu_is_isolated(cpu);
 }
 
+DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 83cec3853864..8c02eeccea3b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -18,12 +18,24 @@ static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
 unsigned long housekeeping_flags;
 EXPORT_SYMBOL_GPL(housekeeping_flags);
 
+DEFINE_STATIC_PERCPU_RWSEM(housekeeping_pcpu_lock);
+
 bool housekeeping_enabled(enum hk_type type)
 {
 	return !!(housekeeping_flags & BIT(type));
 }
 EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
+void housekeeping_lock(void)
+{
+	percpu_down_read(&housekeeping_pcpu_lock);
+}
+
+void housekeeping_unlock(void)
+{
+	percpu_up_read(&housekeeping_pcpu_lock);
+}
+
 int housekeeping_any_cpu(enum hk_type type)
 {
 	int cpu;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..0cdb560ef2f3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -46,6 +46,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex_api.h>
+#include <linux/percpu-rwsem.h>
 #include <linux/plist.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 01/27] sched/isolation: Remove housekeeping static key Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 16:17   ` Bjorn Helgaas
  2025-06-20 15:22 ` [PATCH 04/27] cpu: Protect against concurrent isolated cpuset change Frederic Weisbecker
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Bjorn Helgaas, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka, Waiman Long

HK_TYPE_DOMAIN will soon integrate cpuset isolated partitions and
therefore be made modifyable at runtime. Synchronize against the cpumask
update using appropriate locking.

Queue and wait for the PCI call to complete while holding the
housekeeping rwsem. This way the housekeeping update side doesn't need
to propagate its changes to PCI.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 drivers/pci/pci-driver.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 67db34fd10ee..459d211a408b 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -362,7 +362,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	dev->is_probed = 1;
 
 	cpu_hotplug_disable();
-
+	housekeeping_lock();
 	/*
 	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
 	 * device is probed from work_on_cpu() of the Physical device.
@@ -392,6 +392,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 		error = local_pci_probe(&ddi);
 out:
 	dev->is_probed = 0;
+	housekeeping_unlock();
 	cpu_hotplug_enable();
 	return error;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/27] cpu: Protect against concurrent isolated cpuset change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 05/27] memcg: Prepare to protect " Frederic Weisbecker
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

_cpu_down() is called through work_on_cpu() on a target contained
within the HK_TYPE_DOMAIN cpumask.

But that cpumask will soon also integrate the cpuset isolated
partitions and some synchronization is needed to make sure that
the work_on_cpu() doesn't execute or last on an isolated CPU.

Unfortunately housekeeping_lock() can't be held before the call to
work_on_cpu() because _cpu_down() afterwards holds cpu_hotplug_lock.
This would be a lock inversion:

   cpu_down()                                         cpuset
   ---------                                          ------
   percpu_down_read(&housekeeping_pcpu_lock);         percpu_down_read(&cpu_hotplug_lock);
   percpu_down_write(&cpu_hotplug_lock);              percpu_down_write(&housekeeping_pcpu_lock);

To solve this situation, write-lock the cpu_hotplug_lock around the call
to work_on_cpu(). This will prevent from cpuset to modify the
housekeeping cpumask and therefore synchronize against HK_TYPE_DOMAIN
cpumask changes.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/cpu.c | 44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index a59e009e0be4..069fce6c7eae 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1398,8 +1398,8 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
 }
 
 /* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
-			   enum cpuhp_state target)
+static int __ref _cpu_down_locked(unsigned int cpu, int tasks_frozen,
+				  enum cpuhp_state target)
 {
 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
 	int prev_state, ret = 0;
@@ -1410,8 +1410,6 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 	if (!cpu_present(cpu))
 		return -EINVAL;
 
-	cpus_write_lock();
-
 	cpuhp_tasks_frozen = tasks_frozen;
 
 	prev_state = cpuhp_set_state(cpu, st, target);
@@ -1427,14 +1425,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 		 * return the error code..
 		 */
 		if (ret)
-			goto out;
+			return ret;
 
 		/*
 		 * We might have stopped still in the range of the AP hotplug
 		 * thread. Nothing to do anymore.
 		 */
 		if (st->state > CPUHP_TEARDOWN_CPU)
-			goto out;
+			return ret;
 
 		st->target = target;
 	}
@@ -1452,9 +1450,6 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 		}
 	}
 
-out:
-	cpus_write_unlock();
-	arch_smt_update();
 	return ret;
 }
 
@@ -1463,16 +1458,17 @@ struct cpu_down_work {
 	enum cpuhp_state	target;
 };
 
-static long __cpu_down_maps_locked(void *arg)
+static long __cpu_down_locked_work(void *arg)
 {
 	struct cpu_down_work *work = arg;
 
-	return _cpu_down(work->cpu, 0, work->target);
+	return _cpu_down_locked(work->cpu, 0, work->target);
 }
 
 static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 {
 	struct cpu_down_work work = { .cpu = cpu, .target = target, };
+	int err;
 
 	/*
 	 * If the platform does not support hotplug, report it explicitly to
@@ -1483,17 +1479,24 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
 
+	err = -EBUSY;
+
 	/*
 	 * Ensure that the control task does not run on the to be offlined
 	 * CPU to prevent a deadlock against cfs_b->period_timer.
 	 * Also keep at least one housekeeping cpu onlined to avoid generating
-	 * an empty sched_domain span.
+	 * an empty sched_domain span. Hotplug must be locked already to prevent
+	 * cpusets from concurrently changing the housekeeping mask.
 	 */
+	cpus_write_lock();
 	for_each_cpu_and(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
 		if (cpu != work.cpu)
-			return work_on_cpu(cpu, __cpu_down_maps_locked, &work);
+			err = work_on_cpu(cpu, __cpu_down_locked_work, &work);
 	}
-	return -EBUSY;
+	cpus_write_unlock();
+	arch_smt_update();
+
+	return err;
 }
 
 static int cpu_down(unsigned int cpu, enum cpuhp_state target)
@@ -1896,6 +1899,19 @@ void __init bringup_nonboot_cpus(unsigned int max_cpus)
 #ifdef CONFIG_PM_SLEEP_SMP
 static cpumask_var_t frozen_cpus;
 
+static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
+			    enum cpuhp_state target)
+{
+	int err;
+
+	cpus_write_lock();
+	err = _cpu_down_locked(cpu, tasks_frozen, target);
+	cpus_write_unlock();
+	arch_smt_update();
+
+	return err;
+}
+
 int freeze_secondary_cpus(int primary)
 {
 	int cpu, error = 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/27] memcg: Prepare to protect against concurrent isolated cpuset change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 04/27] cpu: Protect against concurrent isolated cpuset change Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 19:19   ` Shakeel Butt
  2025-06-20 15:22 ` [PATCH 06/27] mm: vmstat: " Frederic Weisbecker
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Johannes Weiner,
	Marco Crivellari, Michal Hocko, Michal Hocko, Muchun Song,
	Peter Zijlstra, Roman Gushchin, Shakeel Butt, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long

The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifyable at
runtime. In order to synchronize against memcg workqueue to make sure
that no asynchronous draining is pending or executing on a newly made
isolated CPU, read-lock the housekeeping rwsem lock while targeting
and queueing a drain work.

Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU had been made isolated.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 mm/memcontrol.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..29d44af6c426 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1975,6 +1975,14 @@ static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock,
 	return flush;
 }
 
+static void schedule_drain_work(int cpu, struct work_struct *work)
+{
+	housekeeping_lock();
+	if (!cpu_is_isolated(cpu))
+		schedule_work_on(cpu, work);
+	housekeeping_unlock();
+}
+
 /*
  * Drains all per-CPU charge caches for given root_memcg resp. subtree
  * of the hierarchy under it.
@@ -2004,8 +2012,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 				      &memcg_st->flags)) {
 			if (cpu == curcpu)
 				drain_local_memcg_stock(&memcg_st->work);
-			else if (!cpu_is_isolated(cpu))
-				schedule_work_on(cpu, &memcg_st->work);
+			else
+				schedule_drain_work(cpu, &memcg_st->work);
 		}
 
 		if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
@@ -2014,8 +2022,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 				      &obj_st->flags)) {
 			if (cpu == curcpu)
 				drain_local_obj_stock(&obj_st->work);
-			else if (!cpu_is_isolated(cpu))
-				schedule_work_on(cpu, &obj_st->work);
+			else
+				schedule_drain_work(cpu, &obj_st->work);
 		}
 	}
 	migrate_enable();
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/27] mm: vmstat: Prepare to protect against concurrent isolated cpuset change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 05/27] memcg: Prepare to protect " Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 07/27] sched/isolation: Save boot defined domain flags Frederic Weisbecker
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka, Waiman Long, linux-mm

The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifyable at
runtime. In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is pending or executing on a newly made
isolated CPU, read-lock the housekeeping rwsem lock while targeting
and queueing a vmstat work.

Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a vmstat
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU had been made isolated.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 mm/vmstat.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 429ae5339bfe..53123675fe31 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2115,11 +2115,13 @@ static void vmstat_shepherd(struct work_struct *w)
 		 * infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
 		 * for all isolated CPUs to avoid interference with the isolated workload.
 		 */
-		if (cpu_is_isolated(cpu))
-			continue;
+		scoped_guard(housekeeping) {
+			if (cpu_is_isolated(cpu))
+				continue;
 
-		if (!delayed_work_pending(dw) && need_update(cpu))
-			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+			if (!delayed_work_pending(dw) && need_update(cpu))
+				queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+		}
 
 		cond_resched();
 	}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/27] sched/isolation: Save boot defined domain flags
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 06/27] mm: vmstat: " Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 08/27] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.

Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.

Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched/isolation.h | 1 +
 kernel/sched/isolation.c        | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 8de4f625a5c1..731506d312d2 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -7,6 +7,7 @@
 #include <linux/tick.h>
 
 enum hk_type {
+	HK_TYPE_DOMAIN_BOOT,
 	HK_TYPE_DOMAIN,
 	HK_TYPE_MANAGED_IRQ,
 	HK_TYPE_KERNEL_NOISE,
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 8c02eeccea3b..9ecf53c5328b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -9,6 +9,7 @@
  */
 
 enum hk_flags {
+	HK_FLAG_DOMAIN_BOOT	= BIT(HK_TYPE_DOMAIN_BOOT),
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
@@ -214,7 +215,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
 
 		if (!strncmp(str, "domain,", 7)) {
 			str += 7;
-			flags |= HK_FLAG_DOMAIN;
+			flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT;
 			continue;
 		}
 
@@ -244,7 +245,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
 
 	/* Default behaviour for isolcpus without flags */
 	if (!flags)
-		flags |= HK_FLAG_DOMAIN;
+		flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT;
 
 	return housekeeping_setup(str, flags);
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/27] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 07/27] sched/isolation: Save boot defined domain flags Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 09/27] driver core: cpu: Convert /sys/devices/system/cpu/isolated " Frederic Weisbecker
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Johannes Weiner, Marco Crivellari,
	Michal Hocko, Michal Koutny, Peter Zijlstra, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups

boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use
the official version.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/cgroup/cpuset.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3bc4301466f3..aae8a739d48d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -80,12 +80,6 @@ static cpumask_var_t	subpartitions_cpus;
  */
 static cpumask_var_t	isolated_cpus;
 
-/*
- * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot
- */
-static cpumask_var_t	boot_hk_cpus;
-static bool		have_boot_isolcpus;
-
 /* List of remote partition root children */
 static struct list_head remote_children;
 
@@ -1601,15 +1595,16 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
  * @new_cpus: cpu mask
  * Return: true if there is conflict, false otherwise
  *
- * CPUs outside of boot_hk_cpus, if defined, can only be used in an
+ * CPUs outside of HK_TYPE_DOMAIN_BOOT, if defined, can only be used in an
  * isolated partition.
  */
 static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 {
-	if (!have_boot_isolcpus)
+	if (!housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
 		return false;
 
-	if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus))
+	if ((prstate != PRS_ISOLATED) &&
+	    !cpumask_subset(new_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)))
 		return true;
 
 	return false;
@@ -3766,12 +3761,9 @@ int __init cpuset_init(void)
 
 	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
 
-	have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN);
-	if (have_boot_isolcpus) {
-		BUG_ON(!alloc_cpumask_var(&boot_hk_cpus, GFP_KERNEL));
-		cpumask_copy(boot_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN));
-		cpumask_andnot(isolated_cpus, cpu_possible_mask, boot_hk_cpus);
-	}
+	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
+		cpumask_andnot(isolated_cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
 
 	return 0;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/27] driver core: cpu: Convert /sys/devices/system/cpu/isolated to use HK_TYPE_DOMAIN_BOOT
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 08/27] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 10/27] net: Keep ignoring isolated cpuset change Frederic Weisbecker
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Danilo Krummrich, Greg Kroah-Hartman,
	Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Rafael J . Wysocki, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

Make sure /sys/devices/system/cpu/isolated only prints what was passed
through the isolcpus= parameter before HK_TYPE_DOMAIN will also
integrate cpuset isolated partitions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 drivers/base/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 7779ab0ca7ce..e1663021fe24 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -291,7 +291,7 @@ static ssize_t print_cpus_isolated(struct device *dev,
 		return -ENOMEM;
 
 	cpumask_andnot(isolated, cpu_possible_mask,
-		       housekeeping_cpumask(HK_TYPE_DOMAIN));
+		       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
 	len = sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(isolated));
 
 	free_cpumask_var(isolated);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/27] net: Keep ignoring isolated cpuset change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 09/27] driver core: cpu: Convert /sys/devices/system/cpu/isolated " Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 11/27] block: Protect against concurrent " Frederic Weisbecker
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Marco Crivellari, Michal Hocko, Paolo Abeni,
	Peter Zijlstra, Simon Horman, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka, Waiman Long, netdev

RPS cpumask can be overriden through sysfs/syctl. The boot defined
isolated CPUs are then excluded from that cpumask.

However HK_TYPE_DOMAIN will soon integrate cpuset isolated
CPUs updates and the RPS infrastructure needs more thoughts to be able
to propagate such changes and synchronize against them.

Keep handling only what was passed through "isolcpus=" for now.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 net/core/net-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1ace0cd01adc..abff68ac34ec 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1012,7 +1012,7 @@ static int netdev_rx_queue_set_rps_mask(struct netdev_rx_queue *queue,
 int rps_cpumask_housekeeping(struct cpumask *mask)
 {
 	if (!cpumask_empty(mask)) {
-		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
 		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
 		if (cpumask_empty(mask))
 			return -EINVAL;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/27] block: Protect against concurrent isolated cpuset change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 10/27] net: Keep ignoring isolated cpuset change Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:59   ` Bart Van Assche
  2025-06-23  5:46   ` Christoph Hellwig
  2025-06-20 15:22 ` [PATCH 12/27] cpu: Provide lockdep check for CPU hotplug lock write-held Frederic Weisbecker
                   ` (16 subsequent siblings)
  27 siblings, 2 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Jens Axboe, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long, linux-block

The block subsystem prevents running the workqueue to isolated CPUs,
including those defined by cpuset isolated partitions. Since
HK_TYPE_DOMAIN will soon contain both and be subject to runtime
modifications, synchronize against housekeeping using the relevant lock.

For full support of cpuset changes, the block subsystem may need to
propagate changes to isolated cpumask through the workqueue in the
future.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 block/blk-mq.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..ece3369825fe 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4237,12 +4237,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
 		/*
 		 * Rule out isolated CPUs from hctx->cpumask to avoid
-		 * running block kworker on isolated CPUs
+		 * running block kworker on isolated CPUs.
+		 * FIXME: cpuset should propagate further changes to isolated CPUs
+		 * here.
 		 */
+		housekeeping_lock();
 		for_each_cpu(cpu, hctx->cpumask) {
 			if (cpu_is_isolated(cpu))
 				cpumask_clear_cpu(cpu, hctx->cpumask);
 		}
+		housekeeping_unlock();
 
 		/*
 		 * Initialize batch roundrobin counts
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 12/27] cpu: Provide lockdep check for CPU hotplug lock write-held
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 11/27] block: Protect against concurrent " Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 13/27] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christoph Lameter, Dennis Zhou,
	Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long, linux-mm

cpuset modifies partitions, including isolated, while holding the cpu
hotplug lock read-held.

This means that write-holding the CPU hotplug lock is safe to
synchronize against housekeeping cpumask changes.

Provide a lockdep check to validate that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuhplock.h    | 1 +
 include/linux/percpu-rwsem.h | 1 +
 kernel/cpu.c                 | 5 +++++
 3 files changed, 7 insertions(+)

diff --git a/include/linux/cpuhplock.h b/include/linux/cpuhplock.h
index f7aa20f62b87..286b3ab92e15 100644
--- a/include/linux/cpuhplock.h
+++ b/include/linux/cpuhplock.h
@@ -13,6 +13,7 @@
 struct device;
 
 extern int lockdep_is_cpus_held(void);
+extern int lockdep_is_cpus_write_held(void);
 
 #ifdef CONFIG_HOTPLUG_CPU
 void cpus_write_lock(void);
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 288f5235649a..c8cb010d655e 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -161,6 +161,7 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 	__percpu_init_rwsem(sem, #sem, &rwsem_key);		\
 })
 
+#define percpu_rwsem_is_write_held(sem)	lockdep_is_held_type(sem, 0)
 #define percpu_rwsem_is_held(sem)	lockdep_is_held(sem)
 #define percpu_rwsem_assert_held(sem)	lockdep_assert_held(sem)
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 069fce6c7eae..ccf11a17c7fd 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -533,6 +533,11 @@ int lockdep_is_cpus_held(void)
 {
 	return percpu_rwsem_is_held(&cpu_hotplug_lock);
 }
+
+int lockdep_is_cpus_write_held(void)
+{
+	return percpu_rwsem_is_write_held(&cpu_hotplug_lock);
+}
 #endif
 
 static void lockdep_acquire_cpus_lock(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 13/27] cpuset: Provide lockdep check for cpuset lock held
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 12/27] cpu: Provide lockdep check for CPU hotplug lock write-held Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 14/27] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Koutný, Johannes Weiner,
	Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups

cpuset modifies partitions, including isolated, while holding the cpuset
mutex.

This means that holding the cpuset mutex is safe to synchronize against
housekeeping cpumask changes.

Provide a lockdep check to validate that.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuset.h | 2 ++
 kernel/cgroup/cpuset.c | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..051d36fec578 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -18,6 +18,8 @@
 #include <linux/mmu_context.h>
 #include <linux/jump_label.h>
 
+extern bool lockdep_is_cpuset_held(void);
+
 #ifdef CONFIG_CPUSETS
 
 /*
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index aae8a739d48d..8221b6a7da46 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -254,6 +254,13 @@ void cpuset_unlock(void)
 	mutex_unlock(&cpuset_mutex);
 }
 
+#ifdef CONFIG_LOCKDEP
+bool lockdep_is_cpuset_held(void)
+{
+	return lockdep_is_held(&cpuset_mutex);
+}
+#endif
+
 static DEFINE_SPINLOCK(callback_lock);
 
 void cpuset_callback_lock_irq(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 14/27] sched/isolation: Convert housekeeping cpumasks to rcu pointers
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 13/27] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 15/27] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

HK_TYPE_DOMAIN's cpumask will soon be made modifyable by cpuset.
Sleepable users of housekeeping can synchronize against cpumask
modifications using the housekeeping rwsem. Other callsites need an
alternative.

Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
cpumask will be modified, the update side will wait for an RCU grace
period and propagate the change to interested subsystem when deemed
necessary.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/isolation.c | 52 ++++++++++++++++++++++++++--------------
 kernel/sched/sched.h     |  1 +
 2 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9ecf53c5328b..75505668dcb9 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -15,7 +15,7 @@ enum hk_flags {
 	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
 };
 
-static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
+static struct cpumask __rcu *housekeeping_cpumasks[HK_TYPE_MAX];
 unsigned long housekeeping_flags;
 EXPORT_SYMBOL_GPL(housekeeping_flags);
 
@@ -37,16 +37,25 @@ void housekeeping_unlock(void)
 	percpu_up_read(&housekeeping_pcpu_lock);
 }
 
+const struct cpumask *housekeeping_cpumask(enum hk_type type)
+{
+	if (housekeeping_flags & BIT(type)) {
+		return rcu_dereference_check(housekeeping_cpumasks[type], 1);
+	}
+	return cpu_possible_mask;
+}
+EXPORT_SYMBOL_GPL(housekeeping_cpumask);
+
 int housekeeping_any_cpu(enum hk_type type)
 {
 	int cpu;
 
 	if (housekeeping_flags & BIT(type)) {
-		cpu = sched_numa_find_closest(housekeeping_cpumasks[type], smp_processor_id());
+		cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
 		if (cpu < nr_cpu_ids)
 			return cpu;
 
-		cpu = cpumask_any_and_distribute(housekeeping_cpumasks[type], cpu_online_mask);
+		cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask);
 		if (likely(cpu < nr_cpu_ids))
 			return cpu;
 		/*
@@ -62,25 +71,17 @@ int housekeeping_any_cpu(enum hk_type type)
 }
 EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
 
-const struct cpumask *housekeeping_cpumask(enum hk_type type)
-{
-	if (housekeeping_flags & BIT(type))
-		return housekeeping_cpumasks[type];
-	return cpu_possible_mask;
-}
-EXPORT_SYMBOL_GPL(housekeeping_cpumask);
-
 void housekeeping_affine(struct task_struct *t, enum hk_type type)
 {
 	if (housekeeping_flags & BIT(type))
-		set_cpus_allowed_ptr(t, housekeeping_cpumasks[type]);
+		set_cpus_allowed_ptr(t, housekeeping_cpumask(type));
 }
 EXPORT_SYMBOL_GPL(housekeeping_affine);
 
 bool housekeeping_test_cpu(int cpu, enum hk_type type)
 {
 	if (housekeeping_flags & BIT(type))
-		return cpumask_test_cpu(cpu, housekeeping_cpumasks[type]);
+		return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
 	return true;
 }
 EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
@@ -95,9 +96,23 @@ void __init housekeeping_init(void)
 	if (housekeeping_flags & HK_FLAG_KERNEL_NOISE)
 		sched_tick_offload_init();
 
+	/*
+	 * Realloc with a proper allocator so that any cpumask update
+	 * can indifferently free the old version with kfree().
+	 */
 	for_each_set_bit(type, &housekeeping_flags, HK_TYPE_MAX) {
+		struct cpumask *omask, *nmask = kmalloc(cpumask_size(), GFP_KERNEL);
+
+		if (WARN_ON_ONCE(!nmask))
+			return;
+
+		omask = rcu_dereference(housekeeping_cpumasks[type]);
+
 		/* We need at least one CPU to handle housekeeping work */
-		WARN_ON_ONCE(cpumask_empty(housekeeping_cpumasks[type]));
+		WARN_ON_ONCE(cpumask_empty(omask));
+		cpumask_copy(nmask, omask);
+		RCU_INIT_POINTER(housekeeping_cpumasks[type], nmask);
+		memblock_free(omask, cpumask_size());
 	}
 }
 
@@ -105,9 +120,10 @@ static void __init housekeeping_setup_type(enum hk_type type,
 					   cpumask_var_t housekeeping_staging)
 {
 
-	alloc_bootmem_cpumask_var(&housekeeping_cpumasks[type]);
-	cpumask_copy(housekeeping_cpumasks[type],
-		     housekeeping_staging);
+	struct cpumask *mask = memblock_alloc_or_panic(cpumask_size(), SMP_CACHE_BYTES);
+
+	cpumask_copy(mask, housekeeping_staging);
+	RCU_INIT_POINTER(housekeeping_cpumasks[type], mask);
 }
 
 static int __init housekeeping_setup(char *str, unsigned long flags)
@@ -160,7 +176,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 
 		for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) {
 			if (!cpumask_equal(housekeeping_staging,
-					   housekeeping_cpumasks[type])) {
+					   housekeeping_cpumask(type))) {
 				pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
 				goto free_housekeeping_staging;
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0cdb560ef2f3..407e7f5ad929 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,6 +42,7 @@
 #include <linux/ktime_api.h>
 #include <linux/lockdep_api.h>
 #include <linux/lockdep.h>
+#include <linux/memblock.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
 #include <linux/module.h>
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 15/27] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 14/27] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 16/27] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Koutný, Ingo Molnar,
	Johannes Weiner, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups

Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...

There are many drawbacks to that approach:

1) Most interested subsystems want to know about all isolated CPUs, not
  just those defined on boot time.

2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
  concurrent cpuset changes.

3) Further cpuset modifications are not propagated to subsystems

Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask under the housekeeping lock.

Subsystems can rely on the housekeeping lock or RCU to synchronize
against concurrent changes.

The propagation mentioned in 3) will be handled in further patches.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched/isolation.h |  5 ++-
 kernel/cgroup/cpuset.c          |  2 +
 kernel/sched/isolation.c        | 71 ++++++++++++++++++++++++++++++---
 kernel/sched/sched.h            |  1 +
 4 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 731506d312d2..f1b309f18511 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -36,7 +36,7 @@ extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
 
 static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 {
-	if (housekeeping_flags & BIT(type))
+	if (READ_ONCE(housekeeping_flags) & BIT(type))
 		return housekeeping_test_cpu(cpu, type);
 	else
 		return true;
@@ -45,6 +45,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 extern void housekeeping_lock(void);
 extern void housekeeping_unlock(void);
 
+extern int housekeeping_update(struct cpumask *mask, enum hk_type type);
+
 extern void __init housekeeping_init(void);
 
 #else
@@ -79,6 +81,7 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 
 static inline void housekeeping_lock(void) { }
 static inline void housekeeping_unlock(void) { }
+static inline int housekeeping_update(struct cpumask *mask, enum hk_type type) { return 0; }
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_CPU_ISOLATION */
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8221b6a7da46..5f169a56f06c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1351,6 +1351,8 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
 
 	ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
 	WARN_ON_ONCE(ret < 0);
+	ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
+	WARN_ON_ONCE(ret < 0);
 }
 
 /**
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 75505668dcb9..7814d60be87e 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -23,7 +23,7 @@ DEFINE_STATIC_PERCPU_RWSEM(housekeeping_pcpu_lock);
 
 bool housekeeping_enabled(enum hk_type type)
 {
-	return !!(housekeeping_flags & BIT(type));
+	return !!(READ_ONCE(housekeeping_flags) & BIT(type));
 }
 EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
@@ -37,12 +37,39 @@ void housekeeping_unlock(void)
 	percpu_up_read(&housekeeping_pcpu_lock);
 }
 
+static bool housekeeping_dereference_check(enum hk_type type)
+{
+	if (type == HK_TYPE_DOMAIN) {
+		if (system_state == SYSTEM_BOOTING)
+			return true;
+		if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
+			return true;
+		if (percpu_rwsem_is_held(&housekeeping_pcpu_lock))
+			return true;
+		if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
+			return true;
+
+		return false;
+	}
+
+	return true;
+}
+
+static inline struct cpumask *__housekeeping_cpumask(enum hk_type type)
+{
+	return rcu_dereference_check(housekeeping_cpumasks[type],
+				     housekeeping_dereference_check(type));
+}
+
 const struct cpumask *housekeeping_cpumask(enum hk_type type)
 {
-	if (housekeeping_flags & BIT(type)) {
-		return rcu_dereference_check(housekeeping_cpumasks[type], 1);
-	}
-	return cpu_possible_mask;
+	const struct cpumask *mask = NULL;
+
+	if (READ_ONCE(housekeeping_flags) & BIT(type))
+		mask = __housekeeping_cpumask(type);
+	if (!mask)
+		mask = cpu_possible_mask;
+	return mask;
 }
 EXPORT_SYMBOL_GPL(housekeeping_cpumask);
 
@@ -80,12 +107,44 @@ EXPORT_SYMBOL_GPL(housekeeping_affine);
 
 bool housekeeping_test_cpu(int cpu, enum hk_type type)
 {
-	if (housekeeping_flags & BIT(type))
+	if (READ_ONCE(housekeeping_flags) & BIT(type))
 		return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
 	return true;
 }
 EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
 
+int housekeeping_update(struct cpumask *mask, enum hk_type type)
+{
+	struct cpumask *trial, *old = NULL;
+
+	if (type != HK_TYPE_DOMAIN)
+		return -ENOTSUPP;
+
+	trial = kmalloc(sizeof(*trial), GFP_KERNEL);
+	if (!trial)
+		return -ENOMEM;
+
+	cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), mask);
+	if (!cpumask_intersects(trial, cpu_online_mask)) {
+		kfree(trial);
+		return -EINVAL;
+	}
+
+	percpu_down_write(&housekeeping_pcpu_lock);
+	if (housekeeping_flags & BIT(type))
+		old = __housekeeping_cpumask(type);
+	else
+		WRITE_ONCE(housekeeping_flags, housekeeping_flags | BIT(type));
+	rcu_assign_pointer(housekeeping_cpumasks[type], trial);
+	percpu_up_write(&housekeeping_pcpu_lock);
+
+	synchronize_rcu();
+
+	kfree(old);
+
+	return 0;
+}
+
 void __init housekeeping_init(void)
 {
 	enum hk_type type;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 407e7f5ad929..04094567cad4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -30,6 +30,7 @@
 #include <linux/context_tracking.h>
 #include <linux/cpufreq.h>
 #include <linux/cpumask_api.h>
+#include <linux/cpuset.h>
 #include <linux/ctype.h>
 #include <linux/file.h>
 #include <linux/fs_api.h>
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 16/27] sched/isolation: Flush memcg workqueues on cpuset isolated partition change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 15/27] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 19:30   ` Shakeel Butt
  2025-06-20 15:22 ` [PATCH 17/27] sched/isolation: Flush vmstat " Frederic Weisbecker
                   ` (11 subsequent siblings)
  27 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Johannes Weiner,
	Marco Crivellari, Michal Hocko, Michal Hocko, Muchun Song,
	Peter Zijlstra, Roman Gushchin, Shakeel Butt, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups, linux-mm

The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.

However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.

Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/memcontrol.h |  4 ++++
 kernel/sched/isolation.c   |  2 ++
 kernel/sched/sched.h       |  1 +
 mm/memcontrol.c            | 12 +++++++++++-
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..ef5036c6bf04 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1046,6 +1046,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 	return id;
 }
 
+void mem_cgroup_flush_workqueue(void);
+
 extern int mem_cgroup_init(void);
 #else /* CONFIG_MEMCG */
 
@@ -1451,6 +1453,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 	return 0;
 }
 
+static inline void mem_cgroup_flush_workqueue(void) { }
+
 static inline int mem_cgroup_init(void) { return 0; }
 #endif /* CONFIG_MEMCG */
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 7814d60be87e..6fb0c7956516 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -140,6 +140,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
 
 	synchronize_rcu();
 
+	mem_cgroup_flush_workqueue();
+
 	kfree(old);
 
 	return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04094567cad4..53107c021fe9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/lockdep_api.h>
 #include <linux/lockdep.h>
 #include <linux/memblock.h>
+#include <linux/memcontrol.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
 #include <linux/module.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29d44af6c426..928b90cdb5ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,8 @@ static bool cgroup_memory_nokmem __ro_after_init;
 /* BPF memory accounting disabled? */
 static bool cgroup_memory_nobpf __ro_after_init;
 
+static struct workqueue_struct *memcg_wq __ro_after_init;
+
 static struct kmem_cache *memcg_cachep;
 static struct kmem_cache *memcg_pn_cachep;
 
@@ -1979,7 +1981,7 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
 {
 	housekeeping_lock();
 	if (!cpu_is_isolated(cpu))
-		schedule_work_on(cpu, work);
+		queue_work_on(cpu, memcg_wq, work);
 	housekeeping_unlock();
 }
 
@@ -5140,6 +5142,11 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 	refill_stock(memcg, nr_pages);
 }
 
+void mem_cgroup_flush_workqueue(void)
+{
+	flush_workqueue(memcg_wq);
+}
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;
@@ -5182,6 +5189,9 @@ int __init mem_cgroup_init(void)
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
+	memcg_wq = alloc_workqueue("memcg", 0, 0);
+	WARN_ON(!memcg_wq);
+
 	for_each_possible_cpu(cpu) {
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_memcg_stock);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 17/27] sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 16/27] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:22 ` [PATCH 18/27] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Andrew Morton, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka, Waiman Long, linux-mm

The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.

This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/vmstat.h   | 2 ++
 kernel/sched/isolation.c | 1 +
 kernel/sched/sched.h     | 1 +
 mm/vmstat.c              | 5 +++++
 4 files changed, 9 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b2ccb6845595..ba7caacdf356 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -303,6 +303,7 @@ int calculate_pressure_threshold(struct zone *zone);
 int calculate_normal_threshold(struct zone *zone);
 void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 				int (*calculate_pressure)(struct zone *));
+void vmstat_flush_workqueue(void);
 #else /* CONFIG_SMP */
 
 /*
@@ -403,6 +404,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void vmstat_flush_workqueue(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_zonestat *pzstats) { }
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 6fb0c7956516..0119685796be 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,6 +141,7 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
 	synchronize_rcu();
 
 	mem_cgroup_flush_workqueue();
+	vmstat_flush_workqueue();
 
 	kfree(old);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 53107c021fe9..e2c4258cb818 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -69,6 +69,7 @@
 #include <linux/types.h>
 #include <linux/u64_stats_sync_api.h>
 #include <linux/uaccess.h>
+#include <linux/vmstat.h>
 #include <linux/wait_api.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue_api.h>
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 53123675fe31..5d462fe12548 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2095,6 +2095,11 @@ static void vmstat_shepherd(struct work_struct *w);
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+void vmstat_flush_workqueue(void)
+{
+	flush_workqueue(mm_percpu_wq);
+}
+
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 18/27] cpuset: Propagate cpuset isolation update to workqueue through housekeeping
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (16 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 17/27] sched/isolation: Flush vmstat " Frederic Weisbecker
@ 2025-06-20 15:22 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 19/27] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Koutný, Ingo Molnar,
	Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long, cgroups

Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.

Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/workqueue.h |  2 +-
 init/Kconfig              |  1 +
 kernel/cgroup/cpuset.c    | 14 ++++++--------
 kernel/sched/isolation.c  |  4 +++-
 kernel/workqueue.c        |  2 +-
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 6e30f275da77..8a32c594bba1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -581,7 +581,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void);
 void free_workqueue_attrs(struct workqueue_attrs *attrs);
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs);
-extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
+extern int workqueue_unbound_exclude_cpumask(const struct cpumask *cpumask);
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
 			struct work_struct *work);
diff --git a/init/Kconfig b/init/Kconfig
index af4c2f085455..b7cbb6e01e8d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1205,6 +1205,7 @@ config CPUSETS
 	bool "Cpuset controller"
 	depends on SMP
 	select UNION_FIND
+	select CPU_ISOLATION
 	help
 	  This option will let you create and manage CPUSETs which
 	  allow dynamically partitioning a system into sets of CPUs and
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5f169a56f06c..98b1ea0ad336 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1340,7 +1340,7 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
 	return isolcpus_updated;
 }
 
-static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
+static void update_housekeeping_cpumask(bool isolcpus_updated)
 {
 	int ret;
 
@@ -1349,8 +1349,6 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
 	if (!isolcpus_updated)
 		return;
 
-	ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
-	WARN_ON_ONCE(ret < 0);
 	ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
 	WARN_ON_ONCE(ret < 0);
 }
@@ -1473,7 +1471,7 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	list_add(&cs->remote_sibling, &remote_children);
 	cpumask_copy(cs->effective_xcpus, tmp->new_cpus);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_housekeeping_cpumask(isolcpus_updated);
 	cpuset_force_rebuild();
 	cs->prs_err = 0;
 
@@ -1514,7 +1512,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
 	compute_effective_exclusive_cpumask(cs, NULL, NULL);
 	reset_partition_data(cs);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_housekeeping_cpumask(isolcpus_updated);
 	cpuset_force_rebuild();
 
 	/*
@@ -1583,7 +1581,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
 	if (xcpus)
 		cpumask_copy(cs->exclusive_cpus, xcpus);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_housekeeping_cpumask(isolcpus_updated);
 	if (adding || deleting)
 		cpuset_force_rebuild();
 
@@ -1947,7 +1945,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 		WARN_ON_ONCE(parent->nr_subparts < 0);
 	}
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_housekeeping_cpumask(isolcpus_updated);
 
 	if ((old_prs != new_prs) && (cmd == partcmd_update))
 		update_partition_exclusive_flag(cs, new_prs);
@@ -2972,7 +2970,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	else if (isolcpus_updated)
 		isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_housekeeping_cpumask(isolcpus_updated);
 
 	/* Force update if switching back to member & update effective_xcpus */
 	update_cpumasks_hier(cs, &tmpmask, !new_prs);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 0119685796be..e4e4fcd4cb2c 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -116,6 +116,7 @@ EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
 int housekeeping_update(struct cpumask *mask, enum hk_type type)
 {
 	struct cpumask *trial, *old = NULL;
+	int err;
 
 	if (type != HK_TYPE_DOMAIN)
 		return -ENOTSUPP;
@@ -142,10 +143,11 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
 
 	mem_cgroup_flush_workqueue();
 	vmstat_flush_workqueue();
+	err = workqueue_unbound_exclude_cpumask(housekeeping_cpumask(type));
 
 	kfree(old);
 
-	return 0;
+	return err;
 }
 
 void __init housekeeping_init(void)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 97f37b5bae66..e55fcf980c5d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6948,7 +6948,7 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
  * This function can be called from cpuset code to provide a set of isolated
  * CPUs that should be excluded from wq_unbound_cpumask.
  */
-int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
+int workqueue_unbound_exclude_cpumask(const struct cpumask *exclude_cpumask)
 {
 	cpumask_var_t cpumask;
 	int ret = 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 19/27] cpuset: Remove cpuset_cpu_is_isolated()
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (17 preceding siblings ...)
  2025-06-20 15:22 ` [PATCH 18/27] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 20/27] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() Frederic Weisbecker
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Koutný, Johannes Weiner,
	Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups

The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
housekeeping cpumask. There is no usecase left interested in just
checking what is isolated by cpuset and not by the isolcpus= kernel
boot parameter.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuset.h          |  6 ------
 include/linux/sched/isolation.h |  3 +--
 kernel/cgroup/cpuset.c          | 12 ------------
 3 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 051d36fec578..a10775a4f702 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -78,7 +78,6 @@ extern void cpuset_lock(void);
 extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
-extern bool cpuset_cpu_is_isolated(int cpu);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -208,11 +207,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
 	return false;
 }
 
-static inline bool cpuset_cpu_is_isolated(int cpu)
-{
-	return false;
-}
-
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
 	return node_possible_map;
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index f1b309f18511..9f039dfb5739 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -89,8 +89,7 @@ static inline void housekeeping_init(void) { }
 static inline bool cpu_is_isolated(int cpu)
 {
 	return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
-	       !housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
-	       cpuset_cpu_is_isolated(cpu);
+	       !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
 }
 
 DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 98b1ea0ad336..db80e72681ed 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -29,7 +29,6 @@
 #include <linux/mempolicy.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
-#include <linux/export.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
 #include <linux/sched/deadline.h>
@@ -1353,17 +1352,6 @@ static void update_housekeeping_cpumask(bool isolcpus_updated)
 	WARN_ON_ONCE(ret < 0);
 }
 
-/**
- * cpuset_cpu_is_isolated - Check if the given CPU is isolated
- * @cpu: the CPU number to be checked
- * Return: true if CPU is used in an isolated partition, false otherwise
- */
-bool cpuset_cpu_is_isolated(int cpu)
-{
-	return cpumask_test_cpu(cpu, isolated_cpus);
-}
-EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
-
 /*
  * compute_effective_exclusive_cpumask - compute effective exclusive CPUs
  * @cs: cpuset
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 20/27] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (18 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 19/27] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 21/27] kthread: Refine naming of affinity related fields Frederic Weisbecker
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

It doesn't make sense to use nohz_full without also isolating the
related CPUs from the domain topology, either through the use of
isolcpus= or cpuset isolated partitions.

And now HK_TYPE_DOMAIN includes all kinds of domain isolated CPUs.

This means that HK_TYPE_KERNEL_NOISE (of which HK_TYPE_TICK is only an
alias) implies HK_TYPE_DOMAIN and therefore checking the latter is
enough to deduce the former.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/sched/isolation.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 9f039dfb5739..46677e8edf76 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -88,8 +88,7 @@ static inline void housekeeping_init(void) { }
 
 static inline bool cpu_is_isolated(int cpu)
 {
-	return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
-	       !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
+	return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN);
 }
 
 DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 21/27] kthread: Refine naming of affinity related fields
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (19 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 20/27] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 22/27] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

The kthreads preferred affinity related fields use "hotplug" as the base
of their naming because the affinity management was initially deemed to
deal with CPU hotplug.

The scope of this role is going to broaden now and also deal with
cpuset isolated partition updates.

Switch the naming accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/kthread.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 85fc068f0083..24008dd9f3dc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,8 +35,8 @@ static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
 struct task_struct *kthreadd_task;
 
-static LIST_HEAD(kthreads_hotplug);
-static DEFINE_MUTEX(kthreads_hotplug_lock);
+static LIST_HEAD(kthread_affinity_list);
+static DEFINE_MUTEX(kthread_affinity_lock);
 
 struct kthread_create_info
 {
@@ -69,7 +69,7 @@ struct kthread {
 	/* To store the full name if task comm is truncated. */
 	char *full_name;
 	struct task_struct *task;
-	struct list_head hotplug_node;
+	struct list_head affinity_node;
 	struct cpumask *preferred_affinity;
 };
 
@@ -129,7 +129,7 @@ bool set_kthread_struct(struct task_struct *p)
 
 	init_completion(&kthread->exited);
 	init_completion(&kthread->parked);
-	INIT_LIST_HEAD(&kthread->hotplug_node);
+	INIT_LIST_HEAD(&kthread->affinity_node);
 	p->vfork_done = &kthread->exited;
 
 	kthread->task = p;
@@ -324,10 +324,10 @@ void __noreturn kthread_exit(long result)
 {
 	struct kthread *kthread = to_kthread(current);
 	kthread->result = result;
-	if (!list_empty(&kthread->hotplug_node)) {
-		mutex_lock(&kthreads_hotplug_lock);
-		list_del(&kthread->hotplug_node);
-		mutex_unlock(&kthreads_hotplug_lock);
+	if (!list_empty(&kthread->affinity_node)) {
+		mutex_lock(&kthread_affinity_lock);
+		list_del(&kthread->affinity_node);
+		mutex_unlock(&kthread_affinity_lock);
 
 		if (kthread->preferred_affinity) {
 			kfree(kthread->preferred_affinity);
@@ -391,9 +391,9 @@ static void kthread_affine_node(void)
 			return;
 		}
 
-		mutex_lock(&kthreads_hotplug_lock);
-		WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
-		list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+		mutex_lock(&kthread_affinity_lock);
+		WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
+		list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
 		/*
 		 * The node cpumask is racy when read from kthread() but:
 		 * - a racing CPU going down will either fail on the subsequent
@@ -403,7 +403,7 @@ static void kthread_affine_node(void)
 		 */
 		kthread_fetch_affinity(kthread, affinity);
 		set_cpus_allowed_ptr(current, affinity);
-		mutex_unlock(&kthreads_hotplug_lock);
+		mutex_unlock(&kthread_affinity_lock);
 
 		free_cpumask_var(affinity);
 	}
@@ -877,10 +877,10 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
 		goto out;
 	}
 
-	mutex_lock(&kthreads_hotplug_lock);
+	mutex_lock(&kthread_affinity_lock);
 	cpumask_copy(kthread->preferred_affinity, mask);
-	WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
-	list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+	WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
+	list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
 	kthread_fetch_affinity(kthread, affinity);
 
 	/* It's safe because the task is inactive. */
@@ -888,7 +888,7 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
 	do_set_cpus_allowed(p, affinity);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
-	mutex_unlock(&kthreads_hotplug_lock);
+	mutex_unlock(&kthread_affinity_lock);
 out:
 	free_cpumask_var(affinity);
 
@@ -908,9 +908,9 @@ static int kthreads_online_cpu(unsigned int cpu)
 	struct kthread *k;
 	int ret;
 
-	guard(mutex)(&kthreads_hotplug_lock);
+	guard(mutex)(&kthread_affinity_lock);
 
-	if (list_empty(&kthreads_hotplug))
+	if (list_empty(&kthread_affinity_list))
 		return 0;
 
 	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
@@ -918,7 +918,7 @@ static int kthreads_online_cpu(unsigned int cpu)
 
 	ret = 0;
 
-	list_for_each_entry(k, &kthreads_hotplug, hotplug_node) {
+	list_for_each_entry(k, &kthread_affinity_list, affinity_node) {
 		if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) ||
 				 kthread_is_per_cpu(k->task))) {
 			ret = -EINVAL;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 22/27] kthread: Include unbound kthreads in the managed affinity list
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (20 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 21/27] kthread: Refine naming of affinity related fields Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 23/27] kthread: Include kthreadd to " Frederic Weisbecker
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

The managed affinity list currently contains only unbound kthreads that
have affinity preferences. Unbound kthreads globally affine by default
are outside of the list because their affinity is automatically managed
by the scheduler (through the fallback housekeeping mask) and by cpuset.

However in order to preserve the preferred affinity of kthreads, cpuset
will delegate the isolated partition update propagation to the
housekeeping and kthread code.

Prepare for that with including all unbound kthreads in the managed
affinity list.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/kthread.c | 59 ++++++++++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 24008dd9f3dc..138bb41ca916 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -366,9 +366,10 @@ static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpum
 	if (kthread->preferred_affinity) {
 		pref = kthread->preferred_affinity;
 	} else {
-		if (WARN_ON_ONCE(kthread->node == NUMA_NO_NODE))
-			return;
-		pref = cpumask_of_node(kthread->node);
+		if (kthread->node == NUMA_NO_NODE)
+			pref = housekeeping_cpumask(HK_TYPE_KTHREAD);
+		else
+			pref = cpumask_of_node(kthread->node);
 	}
 
 	cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_KTHREAD));
@@ -381,32 +382,29 @@ static void kthread_affine_node(void)
 	struct kthread *kthread = to_kthread(current);
 	cpumask_var_t affinity;
 
-	WARN_ON_ONCE(kthread_is_per_cpu(current));
+	if (WARN_ON_ONCE(kthread_is_per_cpu(current)))
+		return;
 
-	if (kthread->node == NUMA_NO_NODE) {
-		housekeeping_affine(current, HK_TYPE_KTHREAD);
-	} else {
-		if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
-			WARN_ON_ONCE(1);
-			return;
-		}
-
-		mutex_lock(&kthread_affinity_lock);
-		WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
-		list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
-		/*
-		 * The node cpumask is racy when read from kthread() but:
-		 * - a racing CPU going down will either fail on the subsequent
-		 *   call to set_cpus_allowed_ptr() or be migrated to housekeepers
-		 *   afterwards by the scheduler.
-		 * - a racing CPU going up will be handled by kthreads_online_cpu()
-		 */
-		kthread_fetch_affinity(kthread, affinity);
-		set_cpus_allowed_ptr(current, affinity);
-		mutex_unlock(&kthread_affinity_lock);
-
-		free_cpumask_var(affinity);
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
+		WARN_ON_ONCE(1);
+		return;
 	}
+
+	mutex_lock(&kthread_affinity_lock);
+	WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
+	list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
+	/*
+	 * The node cpumask is racy when read from kthread() but:
+	 * - a racing CPU going down will either fail on the subsequent
+	 *   call to set_cpus_allowed_ptr() or be migrated to housekeepers
+	 *   afterwards by the scheduler.
+	 * - a racing CPU going up will be handled by kthreads_online_cpu()
+	 */
+	kthread_fetch_affinity(kthread, affinity);
+	set_cpus_allowed_ptr(current, affinity);
+	mutex_unlock(&kthread_affinity_lock);
+
+	free_cpumask_var(affinity);
 }
 
 static int kthread(void *_create)
@@ -924,8 +922,11 @@ static int kthreads_online_cpu(unsigned int cpu)
 			ret = -EINVAL;
 			continue;
 		}
-		kthread_fetch_affinity(k, affinity);
-		set_cpus_allowed_ptr(k->task, affinity);
+
+		if (k->preferred_affinity || k->node != NUMA_NO_NODE) {
+			kthread_fetch_affinity(k, affinity);
+			set_cpus_allowed_ptr(k->task, affinity);
+		}
 	}
 
 	free_cpumask_var(affinity);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 23/27] kthread: Include kthreadd to the managed affinity list
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (21 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 22/27] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 24/27] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management Frederic Weisbecker
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

The unbound kthreads affinity management performed by cpuset is going to
be imported to the kthread core code for consolidation purposes.

Treat kthreadd just like any other kthread.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/kthread.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 138bb41ca916..4aeb09be29f0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -821,12 +821,13 @@ int kthreadd(void *unused)
 	/* Setup a clean context for our children to inherit. */
 	set_task_comm(tsk, comm);
 	ignore_signals(tsk);
-	set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_KTHREAD));
 	set_mems_allowed(node_states[N_MEMORY]);
 
 	current->flags |= PF_NOFREEZE;
 	cgroup_init_kthreadd();
 
+	kthread_affine_node();
+
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (list_empty(&kthread_create_list))
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 24/27] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (22 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 23/27] kthread: Include kthreadd to " Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 25/27] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

Unbound kthreads want to run neither on nohz_full CPUs nor on domain
isolated CPUs. And since nohz_full implies domain isolation, checking
the latter is enough to verify both.

Therefore exclude kthreads from domain isolation.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/kthread.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4aeb09be29f0..42cd6e119335 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -363,18 +363,20 @@ static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpum
 {
 	const struct cpumask *pref;
 
+	guard(rcu)();
+
 	if (kthread->preferred_affinity) {
 		pref = kthread->preferred_affinity;
 	} else {
 		if (kthread->node == NUMA_NO_NODE)
-			pref = housekeeping_cpumask(HK_TYPE_KTHREAD);
+			pref = housekeeping_cpumask(HK_TYPE_DOMAIN);
 		else
 			pref = cpumask_of_node(kthread->node);
 	}
 
-	cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_DOMAIN));
 	if (cpumask_empty(cpumask))
-		cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+		cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 }
 
 static void kthread_affine_node(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 25/27] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (23 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 24/27] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 26/27] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

Tasks that have all their allowed CPUs offline don't want their affinity
to fallback on either nohz_full CPUs or on domain isolated CPUs. And
since nohz_full implies domain isolation, checking the latter is enough
to verify both.

Therefore exclude domain isolation from fallback task affinity.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/mmu_context.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index ac01dc4eb2ce..ed3dd0f3fe19 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -24,7 +24,7 @@ static inline void leave_mm(void) { }
 #ifndef task_cpu_possible_mask
 # define task_cpu_possible_mask(p)	cpu_possible_mask
 # define task_cpu_possible(cpu, p)	true
-# define task_cpu_fallback_mask(p)	housekeeping_cpumask(HK_TYPE_TICK)
+# define task_cpu_fallback_mask(p)	housekeeping_cpumask(HK_TYPE_DOMAIN)
 #else
 # define task_cpu_possible(cpu, p)	cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
 #endif
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 26/27] kthread: Honour kthreads preferred affinity after cpuset changes
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (24 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 25/27] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 15:23 ` [PATCH 27/27] kthread: Comment on the purpose and placement of kthread_affine_node() call Frederic Weisbecker
  2025-06-20 16:08 ` [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Bjorn Helgaas
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Michal Koutný, Ingo Molnar,
	Johannes Weiner, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long, cgroups

When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.

For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.

Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/kthread.h  |  1 +
 kernel/cgroup/cpuset.c   |  5 ++---
 kernel/kthread.c         | 38 +++++++++++++++++++++++++++++---------
 kernel/sched/isolation.c |  2 ++
 4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 8d27403888ce..c92c1149ee6e 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -100,6 +100,7 @@ void kthread_unpark(struct task_struct *k);
 void kthread_parkme(void);
 void kthread_exit(long result) __noreturn;
 void kthread_complete_and_exit(struct completion *, long) __noreturn;
+int kthreads_update_housekeeping(void);
 
 int kthreadd(void *unused);
 extern struct task_struct *kthreadd_task;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index db80e72681ed..99ee187d941b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1130,11 +1130,10 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
 
 		if (top_cs) {
 			/*
+			 * PF_KTHREAD tasks are handled by housekeeping.
 			 * PF_NO_SETAFFINITY tasks are ignored.
-			 * All per cpu kthreads should have PF_NO_SETAFFINITY
-			 * flag set, see kthread_set_per_cpu().
 			 */
-			if (task->flags & PF_NO_SETAFFINITY)
+			if (task->flags & (PF_KTHREAD | PF_NO_SETAFFINITY))
 				continue;
 			cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
 		} else {
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 42cd6e119335..8c1268c2cee9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -896,14 +896,7 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
 	return ret;
 }
 
-/*
- * Re-affine kthreads according to their preferences
- * and the newly online CPU. The CPU down part is handled
- * by select_fallback_rq() which default re-affines to
- * housekeepers from other nodes in case the preferred
- * affinity doesn't apply anymore.
- */
-static int kthreads_online_cpu(unsigned int cpu)
+static int kthreads_update_affinity(bool force)
 {
 	cpumask_var_t affinity;
 	struct kthread *k;
@@ -926,7 +919,7 @@ static int kthreads_online_cpu(unsigned int cpu)
 			continue;
 		}
 
-		if (k->preferred_affinity || k->node != NUMA_NO_NODE) {
+		if (force || k->preferred_affinity || k->node != NUMA_NO_NODE) {
 			kthread_fetch_affinity(k, affinity);
 			set_cpus_allowed_ptr(k->task, affinity);
 		}
@@ -937,6 +930,33 @@ static int kthreads_online_cpu(unsigned int cpu)
 	return ret;
 }
 
+/**
+ * kthreads_update_housekeeping - Update kthreads affinity on cpuset change
+ *
+ * When cpuset changes a partition type to/from "isolated" or updates related
+ * cpumasks, propagate the housekeeping cpumask change to preferred kthreads
+ * affinity.
+ *
+ * Returns 0 if successful, -ENOMEM if temporary mask couldn't
+ * be allocated or -EINVAL in case of internal error.
+ */
+int kthreads_update_housekeeping(void)
+{
+	return kthreads_update_affinity(true);
+}
+
+/*
+ * Re-affine kthreads according to their preferences
+ * and the newly online CPU. The CPU down part is handled
+ * by select_fallback_rq() which default re-affines to
+ * housekeepers from other nodes in case the preferred
+ * affinity doesn't apply anymore.
+ */
+static int kthreads_online_cpu(unsigned int cpu)
+{
+	return kthreads_update_affinity(false);
+}
+
 static int kthreads_init(void)
 {
 	return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online",
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index e4e4fcd4cb2c..2750b80a5511 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -144,6 +144,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
 	mem_cgroup_flush_workqueue();
 	vmstat_flush_workqueue();
 	err = workqueue_unbound_exclude_cpumask(housekeeping_cpumask(type));
+	WARN_ON_ONCE(err < 0);
+	err = kthreads_update_housekeeping();
 
 	kfree(old);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 27/27] kthread: Comment on the purpose and placement of kthread_affine_node() call
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (25 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 26/27] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
@ 2025-06-20 15:23 ` Frederic Weisbecker
  2025-06-20 16:08 ` [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Bjorn Helgaas
  27 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-20 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

It may not appear obvious why kthread_affine_node() is not called before
the kthread creation completion instead of after the first wake-up.

The reason is that kthread_affine_node() applies a default affinity
behaviour that only takes place if no affinity preference have already
been passed by the kthread creation call site.

Add a comment to clarify that.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/kthread.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8c1268c2cee9..85e29b250107 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -454,6 +454,10 @@ static int kthread(void *_create)
 
 	self->started = 1;
 
+	/*
+	 * Apply default node affinity if no call to kthread_bind[_mask]() nor
+	 * kthread_affine_preferred() was issued before the first wake-up.
+	 */
 	if (!(current->flags & PF_NO_SETAFFINITY) && !self->preferred_affinity)
 		kthread_affine_node();
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/27] block: Protect against concurrent isolated cpuset change
  2025-06-20 15:22 ` [PATCH 11/27] block: Protect against concurrent " Frederic Weisbecker
@ 2025-06-20 15:59   ` Bart Van Assche
  2025-06-26 15:03     ` Frederic Weisbecker
  2025-06-23  5:46   ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Bart Van Assche @ 2025-06-20 15:59 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Jens Axboe, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
	linux-block

On 6/20/25 8:22 AM, Frederic Weisbecker wrote:
> The block subsystem prevents running the workqueue to isolated CPUs,
> including those defined by cpuset isolated partitions. Since
> HK_TYPE_DOMAIN will soon contain both and be subject to runtime
> modifications, synchronize against housekeeping using the relevant lock.
> 
> For full support of cpuset changes, the block subsystem may need to
> propagate changes to isolated cpumask through the workqueue in the
> future.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>   block/blk-mq.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..ece3369825fe 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4237,12 +4237,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>   
>   		/*
>   		 * Rule out isolated CPUs from hctx->cpumask to avoid
> -		 * running block kworker on isolated CPUs
> +		 * running block kworker on isolated CPUs.
> +		 * FIXME: cpuset should propagate further changes to isolated CPUs
> +		 * here.
>   		 */
> +		housekeeping_lock();
>   		for_each_cpu(cpu, hctx->cpumask) {
>   			if (cpu_is_isolated(cpu))
>   				cpumask_clear_cpu(cpu, hctx->cpumask);
>   		}
> +		housekeeping_unlock();
>   
>   		/*
>   		 * Initialize batch roundrobin counts

Isn't it expected that function names have the subsystem name as a
prefix? The function name "housekeeping_lock" is not a good name because
that name does not make it clear what subsystem that function affects.
Additionally, "housekeeping" is very vague. Please choose a better name.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity
  2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
                   ` (26 preceding siblings ...)
  2025-06-20 15:23 ` [PATCH 27/27] kthread: Comment on the purpose and placement of kthread_affine_node() call Frederic Weisbecker
@ 2025-06-20 16:08 ` Bjorn Helgaas
  2025-06-26 14:57   ` Frederic Weisbecker
  27 siblings, 1 reply; 51+ messages in thread
From: Bjorn Helgaas @ 2025-06-20 16:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long

On Fri, Jun 20, 2025 at 05:22:41PM +0200, Frederic Weisbecker wrote:
> The kthread code was enhanced lately to provide an infrastructure which
> manages the preferred affinity of unbound kthreads (node or custom
> cpumask) against housekeeping constraints and CPU hotplug events.
> 
> One crucial missing piece is cpuset: when an isolated partition is
> created, deleted, or its CPUs updated, all the unbound kthreads in the
> top cpuset are affine to _all_ the non-isolated CPUs, possibly breaking
> their preferred affinity along the way
> 
> Solve this with performing the kthreads affinity update from cpuset to
> the kthreads consolidated relevant code instead so that preferred
> affinities are honoured.
> 
> The dispatch of the new cpumasks to workqueues and kthreads is performed
> by housekeeping, as per the nice Tejun's suggestion.
> 
> As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
> from isolcpus= and cpuset isolated partitions. Housekeeping cpumasks are
> now modifyable with specific synchronization. A big step toward making
> nohz_full= also mutable through cpuset in the future.

Is there anything in Documentation/ that covers the "housekeeping"
feature (and isolation in general) and how to use it?  I see a few
mentions in kernel-parameters.txt and kernel-per-CPU-kthreads.rst, but
they are only incidental.

Bjorn

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask
  2025-06-20 15:22 ` [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask Frederic Weisbecker
@ 2025-06-20 16:17   ` Bjorn Helgaas
  2025-06-26 14:51     ` Frederic Weisbecker
  0 siblings, 1 reply; 51+ messages in thread
From: Bjorn Helgaas @ 2025-06-20 16:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Bjorn Helgaas, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

On Fri, Jun 20, 2025 at 05:22:44PM +0200, Frederic Weisbecker wrote:
> HK_TYPE_DOMAIN will soon integrate cpuset isolated partitions and
> therefore be made modifyable at runtime. Synchronize against the cpumask
> update using appropriate locking.

s/modifyable/modifiable/

> Queue and wait for the PCI call to complete while holding the
> housekeeping rwsem. This way the housekeeping update side doesn't need
> to propagate its changes to PCI.

What PCI call are we waiting for?  I see housekeeping_lock(), but I
assume that's doing some housekeeping-related mutual exclusion, not
waiting for PCI work.

I don't know how to use housekeeping_lock() or when it's needed.  Can
you add some guidance here and at the housekeeping_lock() definition?

> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  drivers/pci/pci-driver.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 67db34fd10ee..459d211a408b 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -362,7 +362,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	dev->is_probed = 1;
>  
>  	cpu_hotplug_disable();
> -
> +	housekeeping_lock();
>  	/*
>  	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
>  	 * device is probed from work_on_cpu() of the Physical device.
> @@ -392,6 +392,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  		error = local_pci_probe(&ddi);
>  out:
>  	dev->is_probed = 0;
> +	housekeeping_unlock();
>  	cpu_hotplug_enable();
>  	return error;
>  }
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/27] memcg: Prepare to protect against concurrent isolated cpuset change
  2025-06-20 15:22 ` [PATCH 05/27] memcg: Prepare to protect " Frederic Weisbecker
@ 2025-06-20 19:19   ` Shakeel Butt
  0 siblings, 0 replies; 51+ messages in thread
From: Shakeel Butt @ 2025-06-20 19:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Johannes Weiner, Marco Crivellari,
	Michal Hocko, Michal Hocko, Muchun Song, Peter Zijlstra,
	Roman Gushchin, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

On Fri, Jun 20, 2025 at 05:22:46PM +0200, Frederic Weisbecker wrote:
> The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifyable at
> runtime. In order to synchronize against memcg workqueue to make sure
> that no asynchronous draining is pending or executing on a newly made
> isolated CPU, read-lock the housekeeping rwsem lock while targeting
> and queueing a drain work.
> 
> Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg
> workqueue flush will also be issued in a further change to make sure
> that no work remains pending after a CPU had been made isolated.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 16/27] sched/isolation: Flush memcg workqueues on cpuset isolated partition change
  2025-06-20 15:22 ` [PATCH 16/27] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
@ 2025-06-20 19:30   ` Shakeel Butt
  0 siblings, 0 replies; 51+ messages in thread
From: Shakeel Butt @ 2025-06-20 19:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Andrew Morton, Ingo Molnar, Johannes Weiner,
	Marco Crivellari, Michal Hocko, Michal Hocko, Muchun Song,
	Peter Zijlstra, Roman Gushchin, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka, Waiman Long, cgroups, linux-mm

On Fri, Jun 20, 2025 at 05:22:57PM +0200, Frederic Weisbecker wrote:
> The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
> order to synchronize against memcg workqueue to make sure that no
> asynchronous draining is still pending or executing on a newly made
> isolated CPU, the housekeeping susbsystem must flush the memcg
> workqueues.
> 
> However the memcg workqueues can't be flushed easily since they are
> queued to the main per-CPU workqueue pool.
> 
> Solve this with creating a memcg specific pool and provide and use the
> appropriate flushing API.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/27] block: Protect against concurrent isolated cpuset change
  2025-06-20 15:22 ` [PATCH 11/27] block: Protect against concurrent " Frederic Weisbecker
  2025-06-20 15:59   ` Bart Van Assche
@ 2025-06-23  5:46   ` Christoph Hellwig
  2025-06-26 15:33     ` Frederic Weisbecker
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2025-06-23  5:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jens Axboe, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
	linux-block

On Fri, Jun 20, 2025 at 05:22:52PM +0200, Frederic Weisbecker wrote:
> +		 * running block kworker on isolated CPUs.
> +		 * FIXME: cpuset should propagate further changes to isolated CPUs
> +		 * here.

I have no idea what this comments means.  Can you explain it, or help
fixing it?  Or at least send the entire series to all affected
subsystems as there's no way to review it without the context.

If nothing changes please at leat avoid the overly long line.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-20 15:22 ` [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem Frederic Weisbecker
@ 2025-06-23 17:34   ` Waiman Long
  2025-06-23 17:39     ` Tejun Heo
                       ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Waiman Long @ 2025-06-23 17:34 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Ingo Molnar, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka

On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> The HK_TYPE_DOMAIN isolation cpumask, and further the
> HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> future.
>
> The affected subsystems will need to synchronize against those cpumask
> changes so that:
>
> * The reader get a coherent snapshot
> * The housekeeping subsystem can safely propagate a cpumask update to
>    the susbsytems after it has been published.
>
> Protect against readsides that can sleep with per-cpu rwsem. Updates are
> expected to be very rare given that CPU isolation is a niche usecase and
> related cpuset setup happen only in preparation work. On the other hand
> read sides can occur in more frequent paths.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Thanks for the patch series and it certainly has some good ideas. 
However I am a bit concern about the overhead of using percpu-rwsem for 
synchronization especially when the readers have to wait for the 
completion on the writer side. From my point of view, during the 
transition period when new isolated CPUs are being added or old ones 
being removed, the reader will either get the old CPU data or the new 
one depending on the exact timing. The effect the CPU selection may 
persist for a while after the end of the critical section.

Can we just rely on RCU to make sure that it either get the new one or 
the old one but nothing in between without the additional overhead?

My current thinking is to make use CPU hotplug to enable better CPU 
isolation. IOW, I would shut down the affected CPUs, change the 
housekeeping masks and then bring them back online again. That means the 
writer side will take a while to complete.

Cheers,
Longman

> ---
>   include/linux/sched/isolation.h |  7 +++++++
>   kernel/sched/isolation.c        | 12 ++++++++++++
>   kernel/sched/sched.h            |  1 +
>   3 files changed, 20 insertions(+)
>
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index f98ba0d71c52..8de4f625a5c1 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -41,6 +41,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
>   		return true;
>   }
>   
> +extern void housekeeping_lock(void);
> +extern void housekeeping_unlock(void);
> +
>   extern void __init housekeeping_init(void);
>   
>   #else
> @@ -73,6 +76,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
>   	return true;
>   }
>   
> +static inline void housekeeping_lock(void) { }
> +static inline void housekeeping_unlock(void) { }
>   static inline void housekeeping_init(void) { }
>   #endif /* CONFIG_CPU_ISOLATION */
>   
> @@ -84,4 +89,6 @@ static inline bool cpu_is_isolated(int cpu)
>   	       cpuset_cpu_is_isolated(cpu);
>   }
>   
> +DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
> +
>   #endif /* _LINUX_SCHED_ISOLATION_H */
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 83cec3853864..8c02eeccea3b 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -18,12 +18,24 @@ static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
>   unsigned long housekeeping_flags;
>   EXPORT_SYMBOL_GPL(housekeeping_flags);
>   
> +DEFINE_STATIC_PERCPU_RWSEM(housekeeping_pcpu_lock);
> +
>   bool housekeeping_enabled(enum hk_type type)
>   {
>   	return !!(housekeeping_flags & BIT(type));
>   }
>   EXPORT_SYMBOL_GPL(housekeeping_enabled);
>   
> +void housekeeping_lock(void)
> +{
> +	percpu_down_read(&housekeeping_pcpu_lock);
> +}
> +
> +void housekeeping_unlock(void)
> +{
> +	percpu_up_read(&housekeeping_pcpu_lock);
> +}
> +
>   int housekeeping_any_cpu(enum hk_type type)
>   {
>   	int cpu;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 475bb5998295..0cdb560ef2f3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -46,6 +46,7 @@
>   #include <linux/mm.h>
>   #include <linux/module.h>
>   #include <linux/mutex_api.h>
> +#include <linux/percpu-rwsem.h>
>   #include <linux/plist.h>
>   #include <linux/poll.h>
>   #include <linux/proc_fs.h>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-23 17:34   ` Waiman Long
@ 2025-06-23 17:39     ` Tejun Heo
  2025-06-23 17:57       ` Waiman Long
  2025-06-25 12:18     ` Phil Auld
  2025-06-25 14:18     ` Frederic Weisbecker
  2 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2025-06-23 17:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vlastimil Babka

Hello,

On Mon, Jun 23, 2025 at 01:34:58PM -0400, Waiman Long wrote:
> On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> > The HK_TYPE_DOMAIN isolation cpumask, and further the
> > HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> > future.
> > 
> > The affected subsystems will need to synchronize against those cpumask
> > changes so that:
> > 
> > * The reader get a coherent snapshot
> > * The housekeeping subsystem can safely propagate a cpumask update to
> >    the susbsytems after it has been published.
> > 
> > Protect against readsides that can sleep with per-cpu rwsem. Updates are
> > expected to be very rare given that CPU isolation is a niche usecase and
> > related cpuset setup happen only in preparation work. On the other hand
> > read sides can occur in more frequent paths.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Thanks for the patch series and it certainly has some good ideas. However I
> am a bit concern about the overhead of using percpu-rwsem for
> synchronization especially when the readers have to wait for the completion
> on the writer side. From my point of view, during the transition period when
> new isolated CPUs are being added or old ones being removed, the reader will
> either get the old CPU data or the new one depending on the exact timing.
> The effect the CPU selection may persist for a while after the end of the
> critical section.
> 
> Can we just rely on RCU to make sure that it either get the new one or the
> old one but nothing in between without the additional overhead?

So, I had a similar thought - ie. does this need full interlocking so that
when the modification operation can wait for existing users to drain? It'd
be nice to explain that part a bit more. That said, percpu_rwsem read path
is pretty cheap, so if that is a requirement, I doubt the overhead
difference between RCU access and percpu read locking would make meaningful
difference.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-23 17:39     ` Tejun Heo
@ 2025-06-23 17:57       ` Waiman Long
  2025-06-23 18:03         ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Waiman Long @ 2025-06-23 17:57 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vlastimil Babka


On 6/23/25 1:39 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 23, 2025 at 01:34:58PM -0400, Waiman Long wrote:
>> On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
>>> The HK_TYPE_DOMAIN isolation cpumask, and further the
>>> HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
>>> future.
>>>
>>> The affected subsystems will need to synchronize against those cpumask
>>> changes so that:
>>>
>>> * The reader get a coherent snapshot
>>> * The housekeeping subsystem can safely propagate a cpumask update to
>>>     the susbsytems after it has been published.
>>>
>>> Protect against readsides that can sleep with per-cpu rwsem. Updates are
>>> expected to be very rare given that CPU isolation is a niche usecase and
>>> related cpuset setup happen only in preparation work. On the other hand
>>> read sides can occur in more frequent paths.
>>>
>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> Thanks for the patch series and it certainly has some good ideas. However I
>> am a bit concern about the overhead of using percpu-rwsem for
>> synchronization especially when the readers have to wait for the completion
>> on the writer side. From my point of view, during the transition period when
>> new isolated CPUs are being added or old ones being removed, the reader will
>> either get the old CPU data or the new one depending on the exact timing.
>> The effect the CPU selection may persist for a while after the end of the
>> critical section.
>>
>> Can we just rely on RCU to make sure that it either get the new one or the
>> old one but nothing in between without the additional overhead?
> So, I had a similar thought - ie. does this need full interlocking so that
> when the modification operation can wait for existing users to drain? It'd
> be nice to explain that part a bit more. That said, percpu_rwsem read path
> is pretty cheap, so if that is a requirement, I doubt the overhead
> difference between RCU access and percpu read locking would make meaningful
> difference.
>
> Thanks.

The percpu-rwsem does have a cheaper read side compared with rwsem for 
typical use case where writer update happens sparingly. However, when 
the writer has successful acquired the write lock, the readers do have 
to wait until the writer issues a percpu_up_write() call before they can 
proceed. It is the delay introduced by this wait that I am worry about. 
Isolated partitions are typically set up to run RT applications that 
have a strict latency requirement. So any possible latency spike should 
be avoided.

Cheers,
Longman

>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-23 17:57       ` Waiman Long
@ 2025-06-23 18:03         ` Tejun Heo
  2025-06-25 14:30           ` Frederic Weisbecker
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2025-06-23 18:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vlastimil Babka

Hello,

On Mon, Jun 23, 2025 at 01:57:17PM -0400, Waiman Long wrote:
> The percpu-rwsem does have a cheaper read side compared with rwsem for
> typical use case where writer update happens sparingly. However, when the
> writer has successful acquired the write lock, the readers do have to wait
> until the writer issues a percpu_up_write() call before they can proceed. It
> is the delay introduced by this wait that I am worry about. Isolated
> partitions are typically set up to run RT applications that have a strict
> latency requirement. So any possible latency spike should be avoided.

I see. Hmm... this being the mechanism that establishes the isolation, it
doesn't seem too broken if things stutter a bit when isolation is being
updated. Let's see what Frederic says why the strong interlocking is needed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-23 17:34   ` Waiman Long
  2025-06-23 17:39     ` Tejun Heo
@ 2025-06-25 12:18     ` Phil Auld
  2025-06-25 14:34       ` Frederic Weisbecker
  2025-06-25 14:18     ` Frederic Weisbecker
  2 siblings, 1 reply; 51+ messages in thread
From: Phil Auld @ 2025-06-25 12:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka

Hi Waiman,

On Mon, Jun 23, 2025 at 01:34:58PM -0400 Waiman Long wrote:
> On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> > The HK_TYPE_DOMAIN isolation cpumask, and further the
> > HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> > future.
> > 
> > The affected subsystems will need to synchronize against those cpumask
> > changes so that:
> > 
> > * The reader get a coherent snapshot
> > * The housekeeping subsystem can safely propagate a cpumask update to
> >    the susbsytems after it has been published.
> > 
> > Protect against readsides that can sleep with per-cpu rwsem. Updates are
> > expected to be very rare given that CPU isolation is a niche usecase and
> > related cpuset setup happen only in preparation work. On the other hand
> > read sides can occur in more frequent paths.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Thanks for the patch series and it certainly has some good ideas. However I
> am a bit concern about the overhead of using percpu-rwsem for
> synchronization especially when the readers have to wait for the completion
> on the writer side. From my point of view, during the transition period when
> new isolated CPUs are being added or old ones being removed, the reader will
> either get the old CPU data or the new one depending on the exact timing.
> The effect the CPU selection may persist for a while after the end of the
> critical section.
> 
> Can we just rely on RCU to make sure that it either get the new one or the
> old one but nothing in between without the additional overhead?
> 
> My current thinking is to make use CPU hotplug to enable better CPU
> isolation. IOW, I would shut down the affected CPUs, change the housekeeping
> masks and then bring them back online again. That means the writer side will
> take a while to complete.

The problem with this approach is that offlining a cpu effects all the other
cpus and causes latency spikes on other low latency tasks which may already be
running on other parts of the system.

I just don't want us to finally get to dynamic isolation and have it not
usable for the usecases asking for it. 

Cheers,
Phil

> 
> Cheers,
> Longman
> 
> > ---
> >   include/linux/sched/isolation.h |  7 +++++++
> >   kernel/sched/isolation.c        | 12 ++++++++++++
> >   kernel/sched/sched.h            |  1 +
> >   3 files changed, 20 insertions(+)
> > 
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index f98ba0d71c52..8de4f625a5c1 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -41,6 +41,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> >   		return true;
> >   }
> > +extern void housekeeping_lock(void);
> > +extern void housekeeping_unlock(void);
> > +
> >   extern void __init housekeeping_init(void);
> >   #else
> > @@ -73,6 +76,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> >   	return true;
> >   }
> > +static inline void housekeeping_lock(void) { }
> > +static inline void housekeeping_unlock(void) { }
> >   static inline void housekeeping_init(void) { }
> >   #endif /* CONFIG_CPU_ISOLATION */
> > @@ -84,4 +89,6 @@ static inline bool cpu_is_isolated(int cpu)
> >   	       cpuset_cpu_is_isolated(cpu);
> >   }
> > +DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
> > +
> >   #endif /* _LINUX_SCHED_ISOLATION_H */
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 83cec3853864..8c02eeccea3b 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -18,12 +18,24 @@ static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
> >   unsigned long housekeeping_flags;
> >   EXPORT_SYMBOL_GPL(housekeeping_flags);
> > +DEFINE_STATIC_PERCPU_RWSEM(housekeeping_pcpu_lock);
> > +
> >   bool housekeeping_enabled(enum hk_type type)
> >   {
> >   	return !!(housekeeping_flags & BIT(type));
> >   }
> >   EXPORT_SYMBOL_GPL(housekeeping_enabled);
> > +void housekeeping_lock(void)
> > +{
> > +	percpu_down_read(&housekeeping_pcpu_lock);
> > +}
> > +
> > +void housekeeping_unlock(void)
> > +{
> > +	percpu_up_read(&housekeeping_pcpu_lock);
> > +}
> > +
> >   int housekeeping_any_cpu(enum hk_type type)
> >   {
> >   	int cpu;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 475bb5998295..0cdb560ef2f3 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -46,6 +46,7 @@
> >   #include <linux/mm.h>
> >   #include <linux/module.h>
> >   #include <linux/mutex_api.h>
> > +#include <linux/percpu-rwsem.h>
> >   #include <linux/plist.h>
> >   #include <linux/poll.h>
> >   #include <linux/proc_fs.h>
> 
> 

-- 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-23 17:34   ` Waiman Long
  2025-06-23 17:39     ` Tejun Heo
  2025-06-25 12:18     ` Phil Auld
@ 2025-06-25 14:18     ` Frederic Weisbecker
  2025-06-26 23:58       ` Waiman Long
  2 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-25 14:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: LKML, Ingo Molnar, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka

Le Mon, Jun 23, 2025 at 01:34:58PM -0400, Waiman Long a écrit :
> On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> > The HK_TYPE_DOMAIN isolation cpumask, and further the
> > HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> > future.
> > 
> > The affected subsystems will need to synchronize against those cpumask
> > changes so that:
> > 
> > * The reader get a coherent snapshot
> > * The housekeeping subsystem can safely propagate a cpumask update to
> >    the susbsytems after it has been published.
> > 
> > Protect against readsides that can sleep with per-cpu rwsem. Updates are
> > expected to be very rare given that CPU isolation is a niche usecase and
> > related cpuset setup happen only in preparation work. On the other hand
> > read sides can occur in more frequent paths.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Thanks for the patch series and it certainly has some good ideas. However I
> am a bit concern about the overhead of using percpu-rwsem for
> synchronization especially when the readers have to wait for the completion
> on the writer side. From my point of view, during the transition period when
> new isolated CPUs are being added or old ones being removed, the reader will
> either get the old CPU data or the new one depending on the exact timing.
> The effect the CPU selection may persist for a while after the end of the
> critical section.

It depends.

1) If the read side queues a work and wait for it
  (case of work_on_cpu()), we can protect the whole under the same
  sleeping lock and there is no persistance beyond.

2) But if the read side just queues some work or defines some cpumask
   for future queue then there is persistance and some action must be
   taken by housekeeping after the update to propagare the new cpumask
   (flush pending works, etc...)

> Can we just rely on RCU to make sure that it either get the new one or the
> old one but nothing in between without the additional overhead?

This is the case as well and it is covered by 2) above.
The sleeping parts handled in 1) would require more thoughts.

> My current thinking is to make use CPU hotplug to enable better CPU
> isolation. IOW, I would shut down the affected CPUs, change the housekeeping
> masks and then bring them back online again. That means the writer side will
> take a while to complete.

You mean that an isolated partition should only be set on offline CPUs ? That's
the plan for nohz_full but it may be too late for domain isolation.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-23 18:03         ` Tejun Heo
@ 2025-06-25 14:30           ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-25 14:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, LKML, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Thomas Gleixner, Vlastimil Babka

Le Mon, Jun 23, 2025 at 08:03:46AM -1000, Tejun Heo a écrit :
> Hello,
> 
> On Mon, Jun 23, 2025 at 01:57:17PM -0400, Waiman Long wrote:
> > The percpu-rwsem does have a cheaper read side compared with rwsem for
> > typical use case where writer update happens sparingly. However, when the
> > writer has successful acquired the write lock, the readers do have to wait
> > until the writer issues a percpu_up_write() call before they can proceed. It
> > is the delay introduced by this wait that I am worry about. Isolated
> > partitions are typically set up to run RT applications that have a strict
> > latency requirement. So any possible latency spike should be avoided.
> 
> I see. Hmm... this being the mechanism that establishes the isolation, it
> doesn't seem too broken if things stutter a bit when isolation is being
> updated. Let's see what Frederic says why the strong interlocking is needed.

I should be able to work around that.
I think only PCI requires that rwsem because it relies on work_on_cpu().

I can create a dedicated workqueue for it that housekeeping can flush after
the cpumask update.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-25 12:18     ` Phil Auld
@ 2025-06-25 14:34       ` Frederic Weisbecker
  2025-06-25 15:50         ` Phil Auld
  0 siblings, 1 reply; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-25 14:34 UTC (permalink / raw)
  To: Phil Auld
  Cc: Waiman Long, LKML, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka

Le Wed, Jun 25, 2025 at 08:18:50AM -0400, Phil Auld a écrit :
> Hi Waiman,
> 
> On Mon, Jun 23, 2025 at 01:34:58PM -0400 Waiman Long wrote:
> > On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> > > The HK_TYPE_DOMAIN isolation cpumask, and further the
> > > HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> > > future.
> > > 
> > > The affected subsystems will need to synchronize against those cpumask
> > > changes so that:
> > > 
> > > * The reader get a coherent snapshot
> > > * The housekeeping subsystem can safely propagate a cpumask update to
> > >    the susbsytems after it has been published.
> > > 
> > > Protect against readsides that can sleep with per-cpu rwsem. Updates are
> > > expected to be very rare given that CPU isolation is a niche usecase and
> > > related cpuset setup happen only in preparation work. On the other hand
> > > read sides can occur in more frequent paths.
> > > 
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > 
> > Thanks for the patch series and it certainly has some good ideas. However I
> > am a bit concern about the overhead of using percpu-rwsem for
> > synchronization especially when the readers have to wait for the completion
> > on the writer side. From my point of view, during the transition period when
> > new isolated CPUs are being added or old ones being removed, the reader will
> > either get the old CPU data or the new one depending on the exact timing.
> > The effect the CPU selection may persist for a while after the end of the
> > critical section.
> > 
> > Can we just rely on RCU to make sure that it either get the new one or the
> > old one but nothing in between without the additional overhead?
> > 
> > My current thinking is to make use CPU hotplug to enable better CPU
> > isolation. IOW, I would shut down the affected CPUs, change the housekeeping
> > masks and then bring them back online again. That means the writer side will
> > take a while to complete.
> 
> The problem with this approach is that offlining a cpu effects all the other
> cpus and causes latency spikes on other low latency tasks which may already be
> running on other parts of the system.
> 
> I just don't want us to finally get to dynamic isolation and have it not
> usable for the usecases asking for it.

We'll have to discuss that eventually because that's the plan for nohz_full.
We can work around the stop machine rendez-vous on nohz_full if that's the
problem. If the issue is not to interrupt common RT-tasks, then that's a
different problem for which I don't have a solution.

Thanks.

> 
> Cheers,
> Phil
> 
> > 
> > Cheers,
> > Longman
> > 
> > > ---
> > >   include/linux/sched/isolation.h |  7 +++++++
> > >   kernel/sched/isolation.c        | 12 ++++++++++++
> > >   kernel/sched/sched.h            |  1 +
> > >   3 files changed, 20 insertions(+)
> > > 
> > > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > > index f98ba0d71c52..8de4f625a5c1 100644
> > > --- a/include/linux/sched/isolation.h
> > > +++ b/include/linux/sched/isolation.h
> > > @@ -41,6 +41,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> > >   		return true;
> > >   }
> > > +extern void housekeeping_lock(void);
> > > +extern void housekeeping_unlock(void);
> > > +
> > >   extern void __init housekeeping_init(void);
> > >   #else
> > > @@ -73,6 +76,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> > >   	return true;
> > >   }
> > > +static inline void housekeeping_lock(void) { }
> > > +static inline void housekeeping_unlock(void) { }
> > >   static inline void housekeeping_init(void) { }
> > >   #endif /* CONFIG_CPU_ISOLATION */
> > > @@ -84,4 +89,6 @@ static inline bool cpu_is_isolated(int cpu)
> > >   	       cpuset_cpu_is_isolated(cpu);
> > >   }
> > > +DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
> > > +
> > >   #endif /* _LINUX_SCHED_ISOLATION_H */
> > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > index 83cec3853864..8c02eeccea3b 100644
> > > --- a/kernel/sched/isolation.c
> > > +++ b/kernel/sched/isolation.c
> > > @@ -18,12 +18,24 @@ static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
> > >   unsigned long housekeeping_flags;
> > >   EXPORT_SYMBOL_GPL(housekeeping_flags);
> > > +DEFINE_STATIC_PERCPU_RWSEM(housekeeping_pcpu_lock);
> > > +
> > >   bool housekeeping_enabled(enum hk_type type)
> > >   {
> > >   	return !!(housekeeping_flags & BIT(type));
> > >   }
> > >   EXPORT_SYMBOL_GPL(housekeeping_enabled);
> > > +void housekeeping_lock(void)
> > > +{
> > > +	percpu_down_read(&housekeeping_pcpu_lock);
> > > +}
> > > +
> > > +void housekeeping_unlock(void)
> > > +{
> > > +	percpu_up_read(&housekeeping_pcpu_lock);
> > > +}
> > > +
> > >   int housekeeping_any_cpu(enum hk_type type)
> > >   {
> > >   	int cpu;
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 475bb5998295..0cdb560ef2f3 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -46,6 +46,7 @@
> > >   #include <linux/mm.h>
> > >   #include <linux/module.h>
> > >   #include <linux/mutex_api.h>
> > > +#include <linux/percpu-rwsem.h>
> > >   #include <linux/plist.h>
> > >   #include <linux/poll.h>
> > >   #include <linux/proc_fs.h>
> > 
> > 
> 
> -- 
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-25 14:34       ` Frederic Weisbecker
@ 2025-06-25 15:50         ` Phil Auld
  2025-06-27  0:11           ` Waiman Long
  0 siblings, 1 reply; 51+ messages in thread
From: Phil Auld @ 2025-06-25 15:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Waiman Long, LKML, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka

On Wed, Jun 25, 2025 at 04:34:18PM +0200 Frederic Weisbecker wrote:
> Le Wed, Jun 25, 2025 at 08:18:50AM -0400, Phil Auld a écrit :
> > Hi Waiman,
> > 
> > On Mon, Jun 23, 2025 at 01:34:58PM -0400 Waiman Long wrote:
> > > On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> > > > The HK_TYPE_DOMAIN isolation cpumask, and further the
> > > > HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> > > > future.
> > > > 
> > > > The affected subsystems will need to synchronize against those cpumask
> > > > changes so that:
> > > > 
> > > > * The reader get a coherent snapshot
> > > > * The housekeeping subsystem can safely propagate a cpumask update to
> > > >    the susbsytems after it has been published.
> > > > 
> > > > Protect against readsides that can sleep with per-cpu rwsem. Updates are
> > > > expected to be very rare given that CPU isolation is a niche usecase and
> > > > related cpuset setup happen only in preparation work. On the other hand
> > > > read sides can occur in more frequent paths.
> > > > 
> > > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > > 
> > > Thanks for the patch series and it certainly has some good ideas. However I
> > > am a bit concern about the overhead of using percpu-rwsem for
> > > synchronization especially when the readers have to wait for the completion
> > > on the writer side. From my point of view, during the transition period when
> > > new isolated CPUs are being added or old ones being removed, the reader will
> > > either get the old CPU data or the new one depending on the exact timing.
> > > The effect the CPU selection may persist for a while after the end of the
> > > critical section.
> > > 
> > > Can we just rely on RCU to make sure that it either get the new one or the
> > > old one but nothing in between without the additional overhead?
> > > 
> > > My current thinking is to make use CPU hotplug to enable better CPU
> > > isolation. IOW, I would shut down the affected CPUs, change the housekeeping
> > > masks and then bring them back online again. That means the writer side will
> > > take a while to complete.
> > 
> > The problem with this approach is that offlining a cpu effects all the other
> > cpus and causes latency spikes on other low latency tasks which may already be
> > running on other parts of the system.
> > 
> > I just don't want us to finally get to dynamic isolation and have it not
> > usable for the usecases asking for it.
> 
> We'll have to discuss that eventually because that's the plan for nohz_full.
> We can work around the stop machine rendez-vous on nohz_full if that's the
> problem. If the issue is not to interrupt common RT-tasks, then that's a
> different problem for which I don't have a solution.
>

My understanding is that it's the stop machine issue. If you have a way
around that then great!


Cheers,
Phil




> Thanks.
> 
> > 
> > Cheers,
> > Phil
> > 
> > > 
> > > Cheers,
> > > Longman
> > > 
> > > > ---
> > > >   include/linux/sched/isolation.h |  7 +++++++
> > > >   kernel/sched/isolation.c        | 12 ++++++++++++
> > > >   kernel/sched/sched.h            |  1 +
> > > >   3 files changed, 20 insertions(+)
> > > > 
> > > > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > > > index f98ba0d71c52..8de4f625a5c1 100644
> > > > --- a/include/linux/sched/isolation.h
> > > > +++ b/include/linux/sched/isolation.h
> > > > @@ -41,6 +41,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> > > >   		return true;
> > > >   }
> > > > +extern void housekeeping_lock(void);
> > > > +extern void housekeeping_unlock(void);
> > > > +
> > > >   extern void __init housekeeping_init(void);
> > > >   #else
> > > > @@ -73,6 +76,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> > > >   	return true;
> > > >   }
> > > > +static inline void housekeeping_lock(void) { }
> > > > +static inline void housekeeping_unlock(void) { }
> > > >   static inline void housekeeping_init(void) { }
> > > >   #endif /* CONFIG_CPU_ISOLATION */
> > > > @@ -84,4 +89,6 @@ static inline bool cpu_is_isolated(int cpu)
> > > >   	       cpuset_cpu_is_isolated(cpu);
> > > >   }
> > > > +DEFINE_LOCK_GUARD_0(housekeeping, housekeeping_lock(), housekeeping_unlock())
> > > > +
> > > >   #endif /* _LINUX_SCHED_ISOLATION_H */
> > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > > index 83cec3853864..8c02eeccea3b 100644
> > > > --- a/kernel/sched/isolation.c
> > > > +++ b/kernel/sched/isolation.c
> > > > @@ -18,12 +18,24 @@ static cpumask_var_t housekeeping_cpumasks[HK_TYPE_MAX];
> > > >   unsigned long housekeeping_flags;
> > > >   EXPORT_SYMBOL_GPL(housekeeping_flags);
> > > > +DEFINE_STATIC_PERCPU_RWSEM(housekeeping_pcpu_lock);
> > > > +
> > > >   bool housekeeping_enabled(enum hk_type type)
> > > >   {
> > > >   	return !!(housekeeping_flags & BIT(type));
> > > >   }
> > > >   EXPORT_SYMBOL_GPL(housekeeping_enabled);
> > > > +void housekeeping_lock(void)
> > > > +{
> > > > +	percpu_down_read(&housekeeping_pcpu_lock);
> > > > +}
> > > > +
> > > > +void housekeeping_unlock(void)
> > > > +{
> > > > +	percpu_up_read(&housekeeping_pcpu_lock);
> > > > +}
> > > > +
> > > >   int housekeeping_any_cpu(enum hk_type type)
> > > >   {
> > > >   	int cpu;
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index 475bb5998295..0cdb560ef2f3 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -46,6 +46,7 @@
> > > >   #include <linux/mm.h>
> > > >   #include <linux/module.h>
> > > >   #include <linux/mutex_api.h>
> > > > +#include <linux/percpu-rwsem.h>
> > > >   #include <linux/plist.h>
> > > >   #include <linux/poll.h>
> > > >   #include <linux/proc_fs.h>
> > > 
> > > 
> > 
> > -- 
> > 
> 
> -- 
> Frederic Weisbecker
> SUSE Labs
> 

-- 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask
  2025-06-20 16:17   ` Bjorn Helgaas
@ 2025-06-26 14:51     ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-26 14:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: LKML, Bjorn Helgaas, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka,
	Waiman Long

Le Fri, Jun 20, 2025 at 11:17:10AM -0500, Bjorn Helgaas a écrit :
> On Fri, Jun 20, 2025 at 05:22:44PM +0200, Frederic Weisbecker wrote:
> > HK_TYPE_DOMAIN will soon integrate cpuset isolated partitions and
> > therefore be made modifyable at runtime. Synchronize against the cpumask
> > update using appropriate locking.
> 
> s/modifyable/modifiable/
> 
> > Queue and wait for the PCI call to complete while holding the
> > housekeeping rwsem. This way the housekeeping update side doesn't need
> > to propagate its changes to PCI.
> 
> What PCI call are we waiting for?  I see housekeeping_lock(), but I
> assume that's doing some housekeeping-related mutual exclusion, not
> waiting for PCI work.

It's waiting for the call to work_on_cpu() to complete (along with
the CPU election through housekeeping_cpumask()).

> 
> I don't know how to use housekeeping_lock() or when it's needed.  Can
> you add some guidance here and at the housekeeping_lock() definition?

You're right, it's missing documentation, context and guidance. I'll
try to fill that in the next iteration. Also the lock is likely going
to be replaced by RCU instead.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity
  2025-06-20 16:08 ` [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Bjorn Helgaas
@ 2025-06-26 14:57   ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-26 14:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: LKML, Marco Crivellari, Michal Hocko, Peter Zijlstra, Tejun Heo,
	Thomas Gleixner, Vlastimil Babka, Waiman Long

Le Fri, Jun 20, 2025 at 11:08:47AM -0500, Bjorn Helgaas a écrit :
> On Fri, Jun 20, 2025 at 05:22:41PM +0200, Frederic Weisbecker wrote:
> > The kthread code was enhanced lately to provide an infrastructure which
> > manages the preferred affinity of unbound kthreads (node or custom
> > cpumask) against housekeeping constraints and CPU hotplug events.
> > 
> > One crucial missing piece is cpuset: when an isolated partition is
> > created, deleted, or its CPUs updated, all the unbound kthreads in the
> > top cpuset are affine to _all_ the non-isolated CPUs, possibly breaking
> > their preferred affinity along the way
> > 
> > Solve this with performing the kthreads affinity update from cpuset to
> > the kthreads consolidated relevant code instead so that preferred
> > affinities are honoured.
> > 
> > The dispatch of the new cpumasks to workqueues and kthreads is performed
> > by housekeeping, as per the nice Tejun's suggestion.
> > 
> > As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
> > from isolcpus= and cpuset isolated partitions. Housekeeping cpumasks are
> > now modifyable with specific synchronization. A big step toward making
> > nohz_full= also mutable through cpuset in the future.
> 
> Is there anything in Documentation/ that covers the "housekeeping"
> feature (and isolation in general) and how to use it?  I see a few
> mentions in kernel-parameters.txt and kernel-per-CPU-kthreads.rst, but
> they are only incidental.

Not yet, I'll try that for the next take.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/27] block: Protect against concurrent isolated cpuset change
  2025-06-20 15:59   ` Bart Van Assche
@ 2025-06-26 15:03     ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-26 15:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: LKML, Jens Axboe, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
	linux-block

Le Fri, Jun 20, 2025 at 08:59:58AM -0700, Bart Van Assche a écrit :
> On 6/20/25 8:22 AM, Frederic Weisbecker wrote:
> > The block subsystem prevents running the workqueue to isolated CPUs,
> > including those defined by cpuset isolated partitions. Since
> > HK_TYPE_DOMAIN will soon contain both and be subject to runtime
> > modifications, synchronize against housekeeping using the relevant lock.
> > 
> > For full support of cpuset changes, the block subsystem may need to
> > propagate changes to isolated cpumask through the workqueue in the
> > future.
> > 
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> >   block/blk-mq.c | 6 +++++-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 4806b867e37d..ece3369825fe 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -4237,12 +4237,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> >   		/*
> >   		 * Rule out isolated CPUs from hctx->cpumask to avoid
> > -		 * running block kworker on isolated CPUs
> > +		 * running block kworker on isolated CPUs.
> > +		 * FIXME: cpuset should propagate further changes to isolated CPUs
> > +		 * here.
> >   		 */
> > +		housekeeping_lock();
> >   		for_each_cpu(cpu, hctx->cpumask) {
> >   			if (cpu_is_isolated(cpu))
> >   				cpumask_clear_cpu(cpu, hctx->cpumask);
> >   		}
> > +		housekeeping_unlock();
> >   		/*
> >   		 * Initialize batch roundrobin counts
> 
> Isn't it expected that function names have the subsystem name as a
> prefix? The function name "housekeeping_lock" is not a good name because
> that name does not make it clear what subsystem that function affects.
> Additionally, "housekeeping" is very vague. Please choose a better name.

Perhaps. "housekeeping_" doesn't match "isolation.c" but there is
already a whole set of APIs with the housekeeping prefix.

Anyway, this will likely disappear and be replaced by RCU instead.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/27] block: Protect against concurrent isolated cpuset change
  2025-06-23  5:46   ` Christoph Hellwig
@ 2025-06-26 15:33     ` Frederic Weisbecker
  0 siblings, 0 replies; 51+ messages in thread
From: Frederic Weisbecker @ 2025-06-26 15:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, Jens Axboe, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
	linux-block

Le Sun, Jun 22, 2025 at 10:46:57PM -0700, Christoph Hellwig a écrit :
> On Fri, Jun 20, 2025 at 05:22:52PM +0200, Frederic Weisbecker wrote:
> > +		 * running block kworker on isolated CPUs.
> > +		 * FIXME: cpuset should propagate further changes to isolated CPUs
> > +		 * here.
> 
> I have no idea what this comments means.  Can you explain it, or help
> fixing it?  Or at least send the entire series to all affected
> subsystems as there's no way to review it without the context.
> 
> If nothing changes please at leat avoid the overly long line.

That's definetly confusing.
I'll try to clarify that on the next iteration, or even try to fix
it myself.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-25 14:18     ` Frederic Weisbecker
@ 2025-06-26 23:58       ` Waiman Long
  0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2025-06-26 23:58 UTC (permalink / raw)
  To: Frederic Weisbecker, Waiman Long
  Cc: LKML, Ingo Molnar, Marco Crivellari, Michal Hocko, Peter Zijlstra,
	Tejun Heo, Thomas Gleixner, Vlastimil Babka

On 6/25/25 10:18 AM, Frederic Weisbecker wrote:
> Le Mon, Jun 23, 2025 at 01:34:58PM -0400, Waiman Long a écrit :
>> On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
>>> The HK_TYPE_DOMAIN isolation cpumask, and further the
>>> HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
>>> future.
>>>
>>> The affected subsystems will need to synchronize against those cpumask
>>> changes so that:
>>>
>>> * The reader get a coherent snapshot
>>> * The housekeeping subsystem can safely propagate a cpumask update to
>>>     the susbsytems after it has been published.
>>>
>>> Protect against readsides that can sleep with per-cpu rwsem. Updates are
>>> expected to be very rare given that CPU isolation is a niche usecase and
>>> related cpuset setup happen only in preparation work. On the other hand
>>> read sides can occur in more frequent paths.
>>>
>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> Thanks for the patch series and it certainly has some good ideas. However I
>> am a bit concern about the overhead of using percpu-rwsem for
>> synchronization especially when the readers have to wait for the completion
>> on the writer side. From my point of view, during the transition period when
>> new isolated CPUs are being added or old ones being removed, the reader will
>> either get the old CPU data or the new one depending on the exact timing.
>> The effect the CPU selection may persist for a while after the end of the
>> critical section.
> It depends.
>
> 1) If the read side queues a work and wait for it
>    (case of work_on_cpu()), we can protect the whole under the same
>    sleeping lock and there is no persistance beyond.
>
> 2) But if the read side just queues some work or defines some cpumask
>     for future queue then there is persistance and some action must be
>     taken by housekeeping after the update to propagare the new cpumask
>     (flush pending works, etc...)

I don't mind doing actions to make sure that the cpumask is properly 
propagated after changing housekeeping cpumasks. I just don't want to 
introduce too much latency on the reader which could be a latency 
sensitive task running on an isolated CPU.

I would say it should be OK to have a grace period (reusing the RCU 
term) after changing the housekeeping cpumasks that tasks running on 
those CPUs that are affected by cpumask changes may or may not 
experience the full effect of the cpumask change. However, we should 
minimize the overhead of those tasks that run on CPUs unrelated to the 
cpumask change ASAP.

>> Can we just rely on RCU to make sure that it either get the new one or the
>> old one but nothing in between without the additional overhead?
> This is the case as well and it is covered by 2) above.
> The sleeping parts handled in 1) would require more thoughts.
>
>> My current thinking is to make use CPU hotplug to enable better CPU
>> isolation. IOW, I would shut down the affected CPUs, change the housekeeping
>> masks and then bring them back online again. That means the writer side will
>> take a while to complete.
> You mean that an isolated partition should only be set on offline CPUs ? That's
> the plan for nohz_full but it may be too late for domain isolation.

Actually I was talking mainly about nohz_full, but we should handle 
changes in HK_TYPE_DOMAIN cpumask the same way.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-25 15:50         ` Phil Auld
@ 2025-06-27  0:11           ` Waiman Long
  2025-06-27  0:48             ` Phil Auld
  0 siblings, 1 reply; 51+ messages in thread
From: Waiman Long @ 2025-06-27  0:11 UTC (permalink / raw)
  To: Phil Auld, Frederic Weisbecker
  Cc: Waiman Long, LKML, Ingo Molnar, Marco Crivellari, Michal Hocko,
	Peter Zijlstra, Tejun Heo, Thomas Gleixner, Vlastimil Babka

On 6/25/25 11:50 AM, Phil Auld wrote:
> On Wed, Jun 25, 2025 at 04:34:18PM +0200 Frederic Weisbecker wrote:
>> Le Wed, Jun 25, 2025 at 08:18:50AM -0400, Phil Auld a écrit :
>>> Hi Waiman,
>>>
>>> On Mon, Jun 23, 2025 at 01:34:58PM -0400 Waiman Long wrote:
>>>> On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
>>>>> The HK_TYPE_DOMAIN isolation cpumask, and further the
>>>>> HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
>>>>> future.
>>>>>
>>>>> The affected subsystems will need to synchronize against those cpumask
>>>>> changes so that:
>>>>>
>>>>> * The reader get a coherent snapshot
>>>>> * The housekeeping subsystem can safely propagate a cpumask update to
>>>>>     the susbsytems after it has been published.
>>>>>
>>>>> Protect against readsides that can sleep with per-cpu rwsem. Updates are
>>>>> expected to be very rare given that CPU isolation is a niche usecase and
>>>>> related cpuset setup happen only in preparation work. On the other hand
>>>>> read sides can occur in more frequent paths.
>>>>>
>>>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>>>> Thanks for the patch series and it certainly has some good ideas. However I
>>>> am a bit concern about the overhead of using percpu-rwsem for
>>>> synchronization especially when the readers have to wait for the completion
>>>> on the writer side. From my point of view, during the transition period when
>>>> new isolated CPUs are being added or old ones being removed, the reader will
>>>> either get the old CPU data or the new one depending on the exact timing.
>>>> The effect the CPU selection may persist for a while after the end of the
>>>> critical section.
>>>>
>>>> Can we just rely on RCU to make sure that it either get the new one or the
>>>> old one but nothing in between without the additional overhead?
>>>>
>>>> My current thinking is to make use CPU hotplug to enable better CPU
>>>> isolation. IOW, I would shut down the affected CPUs, change the housekeeping
>>>> masks and then bring them back online again. That means the writer side will
>>>> take a while to complete.
>>> The problem with this approach is that offlining a cpu effects all the other
>>> cpus and causes latency spikes on other low latency tasks which may already be
>>> running on other parts of the system.
>>>
>>> I just don't want us to finally get to dynamic isolation and have it not
>>> usable for the usecases asking for it.
>> We'll have to discuss that eventually because that's the plan for nohz_full.
>> We can work around the stop machine rendez-vous on nohz_full if that's the
>> problem. If the issue is not to interrupt common RT-tasks, then that's a
>> different problem for which I don't have a solution.
>>
> My understanding is that it's the stop machine issue. If you have a way
> around that then great!

My current thinking is to just run a selected set of CPUHP teardown and 
startup methods relevant to housekeeping cpumasks usage without calling 
the full set from CPUHP_ONLINE to CPUHP_OFFLINE. I don't know if it is 
possible or not or how much additional changes will be needed to make 
that possible. That will skip the CPUHP_TEARDOWN_CPU teardown method 
that is likely the cause of most the latency spike experienced by other 
CPUs.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-27  0:11           ` Waiman Long
@ 2025-06-27  0:48             ` Phil Auld
  2025-06-30 12:59               ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: Phil Auld @ 2025-06-27  0:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Tejun Heo, Thomas Gleixner,
	Vlastimil Babka

On Thu, Jun 26, 2025 at 08:11:54PM -0400 Waiman Long wrote:
> On 6/25/25 11:50 AM, Phil Auld wrote:
> > On Wed, Jun 25, 2025 at 04:34:18PM +0200 Frederic Weisbecker wrote:
> > > Le Wed, Jun 25, 2025 at 08:18:50AM -0400, Phil Auld a écrit :
> > > > Hi Waiman,
> > > > 
> > > > On Mon, Jun 23, 2025 at 01:34:58PM -0400 Waiman Long wrote:
> > > > > On 6/20/25 11:22 AM, Frederic Weisbecker wrote:
> > > > > > The HK_TYPE_DOMAIN isolation cpumask, and further the
> > > > > > HK_TYPE_KERNEL_NOISE cpumask will be made modifiable at runtime in the
> > > > > > future.
> > > > > > 
> > > > > > The affected subsystems will need to synchronize against those cpumask
> > > > > > changes so that:
> > > > > > 
> > > > > > * The reader get a coherent snapshot
> > > > > > * The housekeeping subsystem can safely propagate a cpumask update to
> > > > > >     the susbsytems after it has been published.
> > > > > > 
> > > > > > Protect against readsides that can sleep with per-cpu rwsem. Updates are
> > > > > > expected to be very rare given that CPU isolation is a niche usecase and
> > > > > > related cpuset setup happen only in preparation work. On the other hand
> > > > > > read sides can occur in more frequent paths.
> > > > > > 
> > > > > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > > > > Thanks for the patch series and it certainly has some good ideas. However I
> > > > > am a bit concern about the overhead of using percpu-rwsem for
> > > > > synchronization especially when the readers have to wait for the completion
> > > > > on the writer side. From my point of view, during the transition period when
> > > > > new isolated CPUs are being added or old ones being removed, the reader will
> > > > > either get the old CPU data or the new one depending on the exact timing.
> > > > > The effect the CPU selection may persist for a while after the end of the
> > > > > critical section.
> > > > > 
> > > > > Can we just rely on RCU to make sure that it either get the new one or the
> > > > > old one but nothing in between without the additional overhead?
> > > > > 
> > > > > My current thinking is to make use CPU hotplug to enable better CPU
> > > > > isolation. IOW, I would shut down the affected CPUs, change the housekeeping
> > > > > masks and then bring them back online again. That means the writer side will
> > > > > take a while to complete.
> > > > The problem with this approach is that offlining a cpu effects all the other
> > > > cpus and causes latency spikes on other low latency tasks which may already be
> > > > running on other parts of the system.
> > > > 
> > > > I just don't want us to finally get to dynamic isolation and have it not
> > > > usable for the usecases asking for it.
> > > We'll have to discuss that eventually because that's the plan for nohz_full.
> > > We can work around the stop machine rendez-vous on nohz_full if that's the
> > > problem. If the issue is not to interrupt common RT-tasks, then that's a
> > > different problem for which I don't have a solution.
> > > 
> > My understanding is that it's the stop machine issue. If you have a way
> > around that then great!
> 
> My current thinking is to just run a selected set of CPUHP teardown and
> startup methods relevant to housekeeping cpumasks usage without calling the
> full set from CPUHP_ONLINE to CPUHP_OFFLINE. I don't know if it is possible
> or not or how much additional changes will be needed to make that possible.
> That will skip the CPUHP_TEARDOWN_CPU teardown method that is likely the
> cause of most the latency spike experienced by other CPUs.
>

Yes, CPUHP_TEARDOWN_CPU is the source of the stop_machine I believe.

It'll be interesting to see if you can safely use the cpuhp machinery
selectively like that :)

Cheers,
Phil

> Cheers,
> Longman
> 

-- 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem
  2025-06-27  0:48             ` Phil Auld
@ 2025-06-30 12:59               ` Thomas Gleixner
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2025-06-30 12:59 UTC (permalink / raw)
  To: Phil Auld, Waiman Long
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Marco Crivellari,
	Michal Hocko, Peter Zijlstra, Tejun Heo, Vlastimil Babka

On Thu, Jun 26 2025 at 20:48, Phil Auld wrote:
> On Thu, Jun 26, 2025 at 08:11:54PM -0400 Waiman Long wrote:
>> > My understanding is that it's the stop machine issue. If you have a way
>> > around that then great!
>> 
>> My current thinking is to just run a selected set of CPUHP teardown and
>> startup methods relevant to housekeeping cpumasks usage without calling the
>> full set from CPUHP_ONLINE to CPUHP_OFFLINE. I don't know if it is possible
>> or not or how much additional changes will be needed to make that possible.
>> That will skip the CPUHP_TEARDOWN_CPU teardown method that is likely the
>> cause of most the latency spike experienced by other CPUs.
>>
>
> Yes, CPUHP_TEARDOWN_CPU is the source of the stop_machine I believe.

Correct.

> It'll be interesting to see if you can safely use the cpuhp machinery
> selectively like that :)

It is supposed to work that way and you can exercise it from userspace
via sysfs already. If it fails, then there are bugs in hotplug callbacks
or ordering or ..., which need to be fixed anyway :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-06-30 12:59 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 15:22 [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 01/27] sched/isolation: Remove housekeeping static key Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 02/27] sched/isolation: Introduce housekeeping per-cpu rwsem Frederic Weisbecker
2025-06-23 17:34   ` Waiman Long
2025-06-23 17:39     ` Tejun Heo
2025-06-23 17:57       ` Waiman Long
2025-06-23 18:03         ` Tejun Heo
2025-06-25 14:30           ` Frederic Weisbecker
2025-06-25 12:18     ` Phil Auld
2025-06-25 14:34       ` Frederic Weisbecker
2025-06-25 15:50         ` Phil Auld
2025-06-27  0:11           ` Waiman Long
2025-06-27  0:48             ` Phil Auld
2025-06-30 12:59               ` Thomas Gleixner
2025-06-25 14:18     ` Frederic Weisbecker
2025-06-26 23:58       ` Waiman Long
2025-06-20 15:22 ` [PATCH 03/27] PCI: Protect against concurrent change of housekeeping cpumask Frederic Weisbecker
2025-06-20 16:17   ` Bjorn Helgaas
2025-06-26 14:51     ` Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 04/27] cpu: Protect against concurrent isolated cpuset change Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 05/27] memcg: Prepare to protect " Frederic Weisbecker
2025-06-20 19:19   ` Shakeel Butt
2025-06-20 15:22 ` [PATCH 06/27] mm: vmstat: " Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 07/27] sched/isolation: Save boot defined domain flags Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 08/27] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 09/27] driver core: cpu: Convert /sys/devices/system/cpu/isolated " Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 10/27] net: Keep ignoring isolated cpuset change Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 11/27] block: Protect against concurrent " Frederic Weisbecker
2025-06-20 15:59   ` Bart Van Assche
2025-06-26 15:03     ` Frederic Weisbecker
2025-06-23  5:46   ` Christoph Hellwig
2025-06-26 15:33     ` Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 12/27] cpu: Provide lockdep check for CPU hotplug lock write-held Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 13/27] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 14/27] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 15/27] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 16/27] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
2025-06-20 19:30   ` Shakeel Butt
2025-06-20 15:22 ` [PATCH 17/27] sched/isolation: Flush vmstat " Frederic Weisbecker
2025-06-20 15:22 ` [PATCH 18/27] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 19/27] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 20/27] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 21/27] kthread: Refine naming of affinity related fields Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 22/27] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 23/27] kthread: Include kthreadd to " Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 24/27] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 25/27] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 26/27] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
2025-06-20 15:23 ` [PATCH 27/27] kthread: Comment on the purpose and placement of kthread_affine_node() call Frederic Weisbecker
2025-06-20 16:08 ` [PATCH 00/27] cpuset/isolation: Honour kthreads preferred affinity Bjorn Helgaas
2025-06-26 14:57   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).