[PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed

public inbox for linux-hwmon@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs
@ 2026-04-21  3:03 Waiman Long
  2026-04-21  3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
                   ` (22 more replies)
  0 siblings, 23 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

The "isolcpus=domain" CPU list implied by the HK_TYPE_DOMAIN housekeeping
cpumask is being dynamically updated whenever the set of cpuset isolated
partitions change. This patch series extends the isolated partition code
to make dynamic changes to the "nohz_full" and "isolcpus=managed_irq"
CPU lists. These CPU lists correspoond to equivalent changes in the
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ housekeeping cpumasks.

To facilitate the changing of these CPU lists, the CPU hotplug code
which is doing a lot of heavy lifting is now being used. For changing
the "nohz_full" and HK_TYPE_KERNEL_NOISE cpumasks, the affected CPUs
are torn down into offline mode first. Then the housekeeping cpumask
and the corresponding cpumasks in the RCU, tick and watchdog subsystems
are modified. After that, the affected CPUs are brought online again.

It is slightly different with the "managed_irq" HK_TYPE_MANAGED_IRQ
cpumask, the cpumask is updated first and the affected CPUs are torn
down and brought up to move the managed interrupts away from the newly
isolated CPUs.

Using CPU hotplug does have its drawback when multiple isolated
partitions are being managed in a system. The CPU offline process uses
the stop_machine mechanism to stop all the CPUs except the one being torn
down. This will cause latency spike for isolated CPUs in other isolated
partitions. It is not a problem if only one isolated partition is
needed. This is an issue we need to address in the near future.

Patches 1-7 enables runtime updates to the nohz_full/managed_irq
cpumasks in sched/isolation, tick, RCU and watchdog subsystems.

Patches 8-17 modifies various subsystems that need access to the
HK_TYPE_MANAGED_IRQ cpumask and HK_TYPE_KERNEL_NOISE cpumask and its
aliases. Like the runtime modifiable HK_TYPE_DOMAIN cpumask, RCU is used
to protect access to cpumasks to avoid potential UAF problem. Patch 17
updates the housekeeping_dereference_check() to print a WARNING when
lockdep is enabled if those housekeeping cpumasks are used without the
proper lock protection.

Patch 18 introduces a new cpuhp_offline_cb() API that enables the
shutting down of a given list of CPUs, run a callback function and then
brought those CPUs up again while disallowing any concurrent CPU hotplug
activity.

Patches 19-23 updates the cpuset code, selftest and documentation to
allow change in isolated partition configuration to be reflected in
the housekeeping and other cpumasks dynamically.

As there is a slight overhead in enabling dynamic update to the nohz_full
cpumask, this new nohz_full and managed_irq runtime update feature
has to be explicitly opted in by adding a nohz_full kernel command
line parameter with or without a CPU list to indicate a desire to use
this feature. It is also because a number of subsystems have explicit
check of nohz_full at boot time to adjust their behavior which may not
be easy to modify after boot.

Waiman Long (23):
  sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT &
    HK_TYPE_MANAGED_IRQ_BOOT
  sched/isolation: Enhance housekeeping_update() to support updating
    more than one HK cpumask
  tick/nohz: Make nohz_full parameter optional
  tick/nohz: Allow runtime changes in full dynticks CPUs
  tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
  rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask
  watchdog: Sync up with runtime change of isolated CPUs
  arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
  workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask
  cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
  hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask
  net: Use boot time housekeeping cpumask settings for now
  sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask
  hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask
  Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
  genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ
    cpumask
  sched/isolation: Extend housekeeping_dereference_check() to cover
    changes in nohz_full or manged_irqs           cpumasks
  cpu/hotplug: Add a new cpuhp_offline_cb() API
  cgroup/cpuset: Improve check for calling housekeeping_update()
  cgroup/cpuset: Enable runtime update of
    HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks
  cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated
    partition
  cgroup/cpuset: Prevent offline_disabled CPUs from being used in
    isolated partition
  cgroup/cpuset: Documentation and kselftest updates

 Documentation/admin-guide/cgroup-v2.rst       |  35 ++-
 .../admin-guide/kernel-parameters.txt         |  15 +-
 arch/arm64/kernel/topology.c                  |  17 +-
 drivers/hv/channel_mgmt.c                     |  15 +-
 drivers/hv/vmbus_drv.c                        |   7 +-
 drivers/hwmon/coretemp.c                      |   6 +-
 include/linux/context_tracking.h              |   8 +-
 include/linux/cpuhplock.h                     |   9 +
 include/linux/nmi.h                           |   2 +
 include/linux/rcupdate.h                      |   2 +
 include/linux/sched/isolation.h               |  18 +-
 include/linux/tick.h                          |   2 +
 kernel/cgroup/cpuset-internal.h               |   1 +
 kernel/cgroup/cpuset.c                        | 292 +++++++++++++++++-
 kernel/context_tracking.c                     |  19 +-
 kernel/cpu.c                                  |  72 +++++
 kernel/irq/cpuhotplug.c                       |   1 +
 kernel/irq/manage.c                           |   1 +
 kernel/rcu/tree_nocb.h                        |  24 +-
 kernel/sched/core.c                           |   4 +-
 kernel/sched/isolation.c                      | 135 +++++---
 kernel/time/hrtimer.c                         |   4 +-
 kernel/time/tick-common.c                     |  16 +-
 kernel/time/tick-sched.c                      |  48 ++-
 kernel/watchdog.c                             |  24 ++
 kernel/workqueue.c                            |   6 +-
 net/core/net-sysfs.c                          |   2 +-
 .../selftests/cgroup/test_cpuset_prs.sh       |  70 ++++-
 28 files changed, 747 insertions(+), 108 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

Since commit 4fca0e550d50 ("sched/isolation: Save boot defined
domain flags"), HK_TYPE_DOMAIN_BOOT was added to record the boot
time "isolcpus{=domain}" setting. As we are going to make the
HK_TYPE_MANAGED_IRQ and HK_TYPE_KERNEL_NOISE housekeeping cpumasks
runtime modifiable, we need some additional cpumasks to record the boot
time settings to make sure that those housekeeping cpumasks will always
be a subset of their boot time equivalents.

Introduce the new HK_TYPE_KERNEL_NOISE_BOOT and HK_TYPE_MANAGED_IRQ_BOOT
housekeeping types to do that.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched/isolation.h | 16 ++++++++++++++--
 kernel/sched/isolation.c        | 16 +++++++++-------
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index dc3975ff1b2e..d1707f121e20 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -14,10 +14,22 @@ enum hk_type {
 	 * is always a subset of HK_TYPE_DOMAIN_BOOT.
 	 */
 	HK_TYPE_DOMAIN,
-	/* Inverse of boot-time isolcpus=managed_irq argument */
-	HK_TYPE_MANAGED_IRQ,
+
 	/* Inverse of boot-time nohz_full= or isolcpus=nohz arguments */
+	HK_TYPE_KERNEL_NOISE_BOOT,
+	/*
+	 * A subset of HK_TYPE_KERNEL_NOISE_BOOT as it may excludes some
+	 * additional isolated CPUs at run time.
+	 */
 	HK_TYPE_KERNEL_NOISE,
+
+	/* Inverse of boot-time isolcpus=managed_irq argument */
+	HK_TYPE_MANAGED_IRQ_BOOT,
+	/*
+	 * A subset of HK_TYPE_MANAGED_IRQ_BOOT as it may excludes some
+	 * additional isolated CPUs at run time.
+	 */
+	HK_TYPE_MANAGED_IRQ,
 	HK_TYPE_MAX,
 
 	/*
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index a947d75b43f1..9ec9ae510dc7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -12,10 +12,12 @@
 #include "sched.h"
 
 enum hk_flags {
-	HK_FLAG_DOMAIN_BOOT	= BIT(HK_TYPE_DOMAIN_BOOT),
-	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
-	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
-	HK_FLAG_KERNEL_NOISE	= BIT(HK_TYPE_KERNEL_NOISE),
+	HK_FLAG_DOMAIN_BOOT	  = BIT(HK_TYPE_DOMAIN_BOOT),
+	HK_FLAG_DOMAIN		  = BIT(HK_TYPE_DOMAIN),
+	HK_FLAG_KERNEL_NOISE_BOOT = BIT(HK_TYPE_KERNEL_NOISE_BOOT),
+	HK_FLAG_KERNEL_NOISE	  = BIT(HK_TYPE_KERNEL_NOISE),
+	HK_FLAG_MANAGED_IRQ_BOOT  = BIT(HK_TYPE_MANAGED_IRQ_BOOT),
+	HK_FLAG_MANAGED_IRQ	  = BIT(HK_TYPE_MANAGED_IRQ),
 };
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -315,7 +317,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
 {
 	unsigned long flags;
 
-	flags = HK_FLAG_KERNEL_NOISE;
+	flags = HK_FLAG_KERNEL_NOISE | HK_FLAG_KERNEL_NOISE_BOOT;
 
 	return housekeeping_setup(str, flags);
 }
@@ -334,7 +336,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
 		 */
 		if (!strncmp(str, "nohz,", 5)) {
 			str += 5;
-			flags |= HK_FLAG_KERNEL_NOISE;
+			flags |= HK_FLAG_KERNEL_NOISE | HK_FLAG_KERNEL_NOISE_BOOT;
 			continue;
 		}
 
@@ -346,7 +348,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
 
 		if (!strncmp(str, "managed_irq,", 12)) {
 			str += 12;
-			flags |= HK_FLAG_MANAGED_IRQ;
+			flags |= HK_FLAG_MANAGED_IRQ | HK_FLAG_MANAGED_IRQ_BOOT;
 			continue;
 		}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
  2026-04-21  3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:08   ` sashiko-bot
  2026-04-22  6:39   ` Chen Ridong
  2026-04-21  3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

The housekeeping_update() function currently allows update to the
HK_TYPE_DOMAIN cpumask only. As we are going to enable dynamic
modification of the other housekeeping cpumasks, we need to extend
it to support passing in the information about the HK cpumask(s) to
be updated.  In cases where some HK cpumasks happen to be the same,
it will be more efficient to update multiple HK cpumasks in one single
call instead of calling it multiple times. Extend housekeeping_update()
to support that as well.

Also add the restriction that passed in isolated cpumask parameter
of housekeeping_update() must include all the CPUs isolated at boot
time. This is currently the case for cpuset anyway.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched/isolation.h |  2 +-
 kernel/cgroup/cpuset.c          |  2 +-
 kernel/sched/isolation.c        | 99 +++++++++++++++++++++++----------
 3 files changed, 71 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d1707f121e20..a17f16e0156e 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -51,7 +51,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
 extern bool housekeeping_enabled(enum hk_type type);
 extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
 extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
-extern int housekeeping_update(struct cpumask *isol_mask);
+extern int housekeeping_update(struct cpumask *isol_mask, unsigned long flags);
 extern void __init housekeeping_init(void);
 
 #else
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1335e437098e..a4eccb0ec0d1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1354,7 +1354,7 @@ static void cpuset_update_sd_hk_unlock(void)
 		 */
 		mutex_unlock(&cpuset_mutex);
 		cpus_read_unlock();
-		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
+		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
 		mutex_unlock(&cpuset_top_mutex);
 	} else {
 		cpuset_full_unlock();
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9ec9ae510dc7..965d6f8fe344 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -120,48 +120,87 @@ bool housekeeping_test_cpu(int cpu, enum hk_type type)
 }
 EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
 
-int housekeeping_update(struct cpumask *isol_mask)
-{
-	struct cpumask *trial, *old = NULL;
-	int err;
+/* HK type processing table */
+static struct {
+	int type;
+	int boot_type;
+} hk_types[] = {
+	{ HK_TYPE_DOMAIN,       HK_TYPE_DOMAIN_BOOT	  },
+	{ HK_TYPE_MANAGED_IRQ,  HK_TYPE_MANAGED_IRQ_BOOT  },
+	{ HK_TYPE_KERNEL_NOISE, HK_TYPE_KERNEL_NOISE_BOOT }
+};
 
-	trial = kmalloc(cpumask_size(), GFP_KERNEL);
-	if (!trial)
-		return -ENOMEM;
+#define HK_TYPE_CNT	ARRAY_SIZE(hk_types)
 
-	cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), isol_mask);
-	if (!cpumask_intersects(trial, cpu_online_mask)) {
-		kfree(trial);
-		return -EINVAL;
+int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
+{
+	struct cpumask *trial[HK_TYPE_CNT];
+	int i, err = 0;
+
+	for (i = 0; i < HK_TYPE_CNT; i++) {
+		int type = hk_types[i].type;
+		int boot = hk_types[i].boot_type;
+
+		trial[i] = NULL;
+		if (flags & BIT(type)) {
+			trial[i] = kmalloc(cpumask_size(), GFP_KERNEL);
+			if (!trial[i]) {
+				err = -ENOMEM;
+				goto out;
+			}
+			/*
+			 * The new HK cpumask must be a subset of its boot
+			 * cpumask.
+			 */
+			cpumask_andnot(trial[i], cpu_possible_mask, isol_mask);
+			if (!cpumask_intersects(trial[i], cpu_online_mask) ||
+			    !cpumask_subset(trial[i], housekeeping_cpumask(boot))) {
+				i++;
+				err = -EINVAL;
+				goto out;
+			}
+		}
 	}
 
 	if (!housekeeping.flags)
 		static_branch_enable(&housekeeping_overridden);
 
-	if (housekeeping.flags & HK_FLAG_DOMAIN)
-		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
-	else
-		WRITE_ONCE(housekeeping.flags, housekeeping.flags | HK_FLAG_DOMAIN);
-	rcu_assign_pointer(housekeeping.cpumasks[HK_TYPE_DOMAIN], trial);
-
-	synchronize_rcu();
-
-	pci_probe_flush_workqueue();
-	mem_cgroup_flush_workqueue();
-	vmstat_flush_workqueue();
+	for (i = 0; i < HK_TYPE_CNT; i++) {
+		int type =  hk_types[i].type;
+		struct cpumask *old;
 
-	err = workqueue_unbound_housekeeping_update(housekeeping_cpumask(HK_TYPE_DOMAIN));
-	WARN_ON_ONCE(err < 0);
+		if (!trial[i])
+			continue;
+		old = NULL;
+		if (housekeeping.flags & BIT(type))
+			old = housekeeping_cpumask_dereference(type);
+		rcu_assign_pointer(housekeeping.cpumasks[type], trial[i]);
+		trial[i] = old;
+	}
 
-	err = tmigr_isolated_exclude_cpumask(isol_mask);
-	WARN_ON_ONCE(err < 0);
+	if ((housekeeping.flags & flags) != flags)
+		WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
 
-	err = kthreads_update_housekeeping();
-	WARN_ON_ONCE(err < 0);
+	synchronize_rcu();
 
-	kfree(old);
+	if (flags & HK_FLAG_DOMAIN) {
+		/*
+		 * HK_TYPE_DOMAIN specific callbacks
+		 */
+		pci_probe_flush_workqueue();
+		mem_cgroup_flush_workqueue();
+		vmstat_flush_workqueue();
+
+		WARN_ON_ONCE(workqueue_unbound_housekeeping_update(
+				housekeeping_cpumask(HK_TYPE_DOMAIN)) < 0);
+		WARN_ON_ONCE(tmigr_isolated_exclude_cpumask(isol_mask) < 0);
+		WARN_ON_ONCE(kthreads_update_housekeeping() < 0);
+	}
 
-	return 0;
+out:
+	while (--i >= 0)
+		kfree(trial[i]);
+	return err;
 }
 
 void __init housekeeping_init(void)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
  2026-04-21  3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
  2026-04-21  3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  8:32   ` Thomas Gleixner
  2026-04-22  3:08   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
                   ` (19 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

To provide nohz_full tick support, there is a set of tick dependency
masks that need to be evaluated on every IRQ and context switch.
Switching on nohz_full tick support at runtime will be problematic
as some of the tick dependency masks may not be properly set causing
problem down the road.

Allow nohz_full boot option to be specified without any
parameter to force enable nohz_full tick support without any
CPU in the tick_nohz_full_mask yet. The context_tracking_key and
tick_nohz_full_running flag will be enabled in this case to make
tick_nohz_full_enabled() return true.

There is still a small performance overhead by force enable nohz_full
this way. So it should only be used if there is a chance that some
CPUs may become isolated later via the cpuset isolated partition
functionality and better CPU isolation closed to nohz_full is desired.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 15 +++++++++------
 include/linux/context_tracking.h                |  7 ++++++-
 kernel/context_tracking.c                       |  4 +++-
 kernel/rcu/tree_nocb.h                          |  2 +-
 kernel/sched/isolation.c                        | 13 ++++++++++++-
 kernel/time/tick-sched.c                        | 11 +++++++++--
 6 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 95f97ce487a4..f0eedaebe9d6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4550,13 +4550,16 @@ Kernel parameters
 			Valid arguments: on, off
 			Default: on
 
-	nohz_full=	[KNL,BOOT,SMP,ISOL]
-			The argument is a cpu list, as described above.
+	nohz_full[=cpu-list]
+			[KNL,BOOT,SMP,ISOL]
 			In kernels built with CONFIG_NO_HZ_FULL=y, set
-			the specified list of CPUs whose tick will be stopped
-			whenever possible. The boot CPU will be forced outside
-			the range to maintain the timekeeping.  Any CPUs
-			in this list will have their RCU callbacks offloaded,
+			the specified list of CPUs whose tick will be
+			stopped whenever possible.  If the argument is
+			not specified, nohz_full will be forced enabled
+			without any CPU in the nohz_full list yet.
+			The boot CPU will be forced outside the range
+			to maintain the timekeeping.  Any CPUs in this
+			list will have their RCU callbacks offloaded,
 			just as if they had also been called out in the
 			rcu_nocbs= boot parameter.
 
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a0922..a3fea7f9fef6 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -9,8 +9,13 @@
 
 #include <asm/ptrace.h>
 
-
 #ifdef CONFIG_CONTEXT_TRACKING_USER
+/*
+ * Pass CONTEXT_TRACKING_FORCE_ENABLE to ct_cpu_track_user() to force enable
+ * user context tracking.
+ */
+#define CONTEXT_TRACKING_FORCE_ENABLE	(-1)
+
 extern void ct_cpu_track_user(int cpu);
 
 /* Called with interrupts disabled.  */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a743e7ffa6c0..925999de1a28 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -678,7 +678,9 @@ void __init ct_cpu_track_user(int cpu)
 {
 	static __initdata bool initialized = false;
 
-	if (!per_cpu(context_tracking.active, cpu)) {
+	if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
+		static_branch_inc(&context_tracking_key);
+	} else if (!per_cpu(context_tracking.active, cpu)) {
 		per_cpu(context_tracking.active, cpu) = true;
 		static_branch_inc(&context_tracking_key);
 	}
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index b3337c7231cc..2d06dcb61f37 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1267,7 +1267,7 @@ void __init rcu_init_nohz(void)
 	struct shrinker * __maybe_unused lazy_rcu_shrinker;
 
 #if defined(CONFIG_NO_HZ_FULL)
-	if (tick_nohz_full_running && !cpumask_empty(tick_nohz_full_mask))
+	if (tick_nohz_full_running)
 		cpumask = tick_nohz_full_mask;
 #endif
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 965d6f8fe344..c233d55a1e95 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -268,6 +268,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	}
 
 	alloc_bootmem_cpumask_var(&non_housekeeping_mask);
+
 	if (cpulist_parse(str, non_housekeeping_mask) < 0) {
 		pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n");
 		goto free_non_housekeeping_mask;
@@ -277,6 +278,13 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	cpumask_andnot(housekeeping_staging,
 		       cpu_possible_mask, non_housekeeping_mask);
 
+	/*
+	 * Allow "nohz_full" without parameter to force enable nohz_full
+	 * at boot time without any CPUs in the nohz_full list yet.
+	 */
+	if ((flags & HK_FLAG_KERNEL_NOISE) && !*str)
+		goto setup_housekeeping_staging;
+
 	first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging);
 	if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
 		__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
@@ -290,6 +298,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	if (cpumask_empty(non_housekeeping_mask))
 		goto free_housekeeping_staging;
 
+setup_housekeeping_staging:
 	if (!housekeeping.flags) {
 		/* First setup call ("nohz_full=" or "isolcpus=") */
 		enum hk_type type;
@@ -357,10 +366,12 @@ static int __init housekeeping_nohz_full_setup(char *str)
 	unsigned long flags;
 
 	flags = HK_FLAG_KERNEL_NOISE | HK_FLAG_KERNEL_NOISE_BOOT;
+	if (*str == '=')
+		str++;
 
 	return housekeeping_setup(str, flags);
 }
-__setup("nohz_full=", housekeeping_nohz_full_setup);
+__setup("nohz_full", housekeeping_nohz_full_setup);
 
 static int __init housekeeping_isolcpus_setup(char *str)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e5264458414..ed877b2c9040 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -676,8 +676,15 @@ void __init tick_nohz_init(void)
 		}
 	}
 
-	for_each_cpu(cpu, tick_nohz_full_mask)
-		ct_cpu_track_user(cpu);
+	/*
+	 * Force enable context_tracking_key if tick_nohz_full_mask empty
+	 */
+	if (cpumask_empty(tick_nohz_full_mask)) {
+		ct_cpu_track_user(CONTEXT_TRACKING_FORCE_ENABLE);
+	} else {
+		for_each_cpu(cpu, tick_nohz_full_mask)
+			ct_cpu_track_user(cpu);
+	}
 
 	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
 					"kernel/nohz:predown", NULL,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (2 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  8:50   ` Thomas Gleixner
  2026-04-22  3:08   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

Full dynticks can only be enabled if "nohz_full" boot option has been
been specified with or without parameter. Any change in the list of
nohz_full CPUs have to be reflected in tick_nohz_full_mask. Introduce
a new tick_nohz_full_update_cpus() helper that can be called to update
the tick_nohz_full_mask at run time. The housekeeping_update() function
is modified to call the new helper when the HK_TYPE_KERNEL_NOSIE cpumask
is going to be changed.

We also need to enable CPU context tracking for those CPUs that
are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
and ct_cpu_track_user() so that they be called later when an isolated
cpuset partition is being created. The __ro_after_init attribute is
taken away from context_tracking_key as well.

Also add a new ct_cpu_untrack_user() function to reverse the action of
ct_cpu_track_user() in case we need to disable the nohz_full mode of
a CPU.

With nohz_full enabled, the boot CPU (typically CPU 0) will be the
tick CPU which cannot be shut down easily. So the boot CPU should not
be used in an isolated cpuset partition.

With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become
TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE()
checks in tick-sched.c to avoid unnecessary warnings.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/context_tracking.h |  1 +
 include/linux/tick.h             |  2 ++
 kernel/context_tracking.c        | 15 ++++++++++---
 kernel/sched/isolation.c         |  3 +++
 kernel/time/tick-sched.c         | 37 ++++++++++++++++++++++++++------
 5 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index a3fea7f9fef6..1a6b816f1ad6 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -17,6 +17,7 @@
 #define CONTEXT_TRACKING_FORCE_ENABLE	(-1)
 
 extern void ct_cpu_track_user(int cpu);
+extern void ct_cpu_untrack_user(int cpu);
 
 /* Called with interrupts disabled.  */
 extern void __ct_user_enter(enum ctx_state state);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 738007d6f577..05586f14461c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -274,6 +274,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+extern void tick_nohz_full_update_cpus(struct cpumask *cpumask);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -299,6 +300,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void __tick_nohz_task_switch(void) { }
 static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
+static inline void tick_nohz_full_update_cpus(struct cpumask *cpumask) { }
 #endif
 
 static inline void tick_nohz_task_switch(void)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 925999de1a28..394e432630a3 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
 
-DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
+DEFINE_STATIC_KEY_FALSE(context_tracking_key);
 EXPORT_SYMBOL_GPL(context_tracking_key);
 
 static noinstr bool context_tracking_recursion_enter(void)
@@ -674,9 +674,9 @@ void user_exit_callable(void)
 }
 NOKPROBE_SYMBOL(user_exit_callable);
 
-void __init ct_cpu_track_user(int cpu)
+void ct_cpu_track_user(int cpu)
 {
-	static __initdata bool initialized = false;
+	static bool initialized;
 
 	if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
 		static_branch_inc(&context_tracking_key);
@@ -700,6 +700,15 @@ void __init ct_cpu_track_user(int cpu)
 	initialized = true;
 }
 
+void ct_cpu_untrack_user(int cpu)
+{
+	if (!per_cpu(context_tracking.active, cpu))
+		return;
+
+	per_cpu(context_tracking.active, cpu) = false;
+	static_branch_dec(&context_tracking_key);
+}
+
 #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
 void __init context_tracking_init(void)
 {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index c233d55a1e95..48b155e0b290 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -181,6 +181,9 @@ int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
 	if ((housekeeping.flags & flags) != flags)
 		WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
 
+	if (flags & HK_FLAG_KERNEL_NOISE)
+		tick_nohz_full_update_cpus(isol_mask);
+
 	synchronize_rcu();
 
 	if (flags & HK_FLAG_DOMAIN) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ed877b2c9040..7baa757ca45f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -241,9 +241,6 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
 	tick_cpu = READ_ONCE(tick_do_timer_cpu);
 
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && unlikely(tick_cpu == TICK_DO_TIMER_NONE)) {
-#ifdef CONFIG_NO_HZ_FULL
-		WARN_ON_ONCE(tick_nohz_full_running);
-#endif
 		WRITE_ONCE(tick_do_timer_cpu, cpu);
 		tick_cpu = cpu;
 	}
@@ -629,6 +626,36 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask)
 	tick_nohz_full_running = true;
 }
 
+/* Get the new set of run-time nohz CPU list & update accordingly */
+void tick_nohz_full_update_cpus(struct cpumask *cpumask)
+{
+	int cpu;
+
+	if (!tick_nohz_full_running) {
+		pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n");
+		return;
+	}
+
+	/*
+	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
+	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
+	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
+	 * for those CPUs that have their states changed. Those CPUs should be
+	 * in an offline state.
+	 */
+	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
+		WARN_ON_ONCE(cpu_online(cpu));
+		ct_cpu_track_user(cpu);
+		cpumask_set_cpu(cpu, tick_nohz_full_mask);
+	}
+
+	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
+		WARN_ON_ONCE(cpu_online(cpu));
+		ct_cpu_untrack_user(cpu);
+		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
+	}
+}
+
 bool tick_nohz_cpu_hotpluggable(unsigned int cpu)
 {
 	/*
@@ -1238,10 +1265,6 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 		 */
 		if (tick_cpu == cpu)
 			return false;
-
-		/* Should not happen for nohz-full */
-		if (WARN_ON_ONCE(tick_cpu == TICK_DO_TIMER_NONE))
-			return false;
 	}
 
 	return true;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (3 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  8:55   ` Thomas Gleixner
  2026-04-21  3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

In tick_cpu_dying(), if the dying CPU is the current timekeeper,
it has to pass the job over to another CPU. The current code passes
it to another online CPU. However, that CPU may not be a timer tick
housekeeping CPU.  If that happens, another CPU will have to manually
take it over again later. Avoid this unnecessary work by directly
assigning an online housekeeping CPU.

Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the
non-HK CPUs may not be in stop machine in the future.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/time/tick-common.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d305d8521896..4834a1b9044b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -17,6 +17,7 @@
 #include <linux/profile.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/sched/isolation.h>
 #include <trace/events/power.h>
 
 #include <asm/irq_regs.h>
@@ -394,12 +395,19 @@ int tick_cpu_dying(unsigned int dying_cpu)
 {
 	/*
 	 * If the current CPU is the timekeeper, it's the only one that can
-	 * safely hand over its duty. Also all online CPUs are in stop
-	 * machine, guaranteed not to be idle, therefore there is no
+	 * safely hand over its duty. Also all online housekeeping CPUs are
+	 * in stop machine, guaranteed not to be idle, therefore there is no
 	 * concurrency and it's safe to pick any online successor.
 	 */
-	if (tick_do_timer_cpu == dying_cpu)
-		tick_do_timer_cpu = cpumask_first(cpu_online_mask);
+	if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) {
+		unsigned int new_cpu;
+
+		guard(rcu)();
+		new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK));
+		if (WARN_ON_ONCE(new_cpu >= nr_cpu_ids))
+			new_cpu = cpumask_first(cpu_online_mask);
+		WRITE_ONCE(tick_do_timer_cpu, new_cpu);
+	}
 
 	/* Make sure the CPU won't try to retake the timekeeping duty */
 	tick_sched_timer_dying(dying_cpu);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (4 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:08   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

We can make use of the rcu_nocb_cpu_offload()/rcu_nocb_cpu_deoffload()
APIs to enable RCU NO-CB CPU offloading of newly isolated CPUs and
deoffloading of de-isolated CPUs.

Add a new rcu_nocb_update_cpus() helper to do that and call it directly
from housekeeping_update() when the HK_TYPE_KERNEL_NOISE cpumask is
being changed.

This dynamic RCU NO-CB CPU offloading feature can only used if either
"rcs_nocbs" or the "nohz_full" boot command parameters are used with or
without parameter so that the proper RCU NO-CB resources are properly
initialized at boot time.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rcupdate.h |  2 ++
 kernel/rcu/tree_nocb.h   | 22 ++++++++++++++++++++++
 kernel/sched/isolation.c |  4 +++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 04f3f86a4145..987e3d1d413e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -150,6 +150,7 @@ void rcu_init_nohz(void);
 int rcu_nocb_cpu_offload(int cpu);
 int rcu_nocb_cpu_deoffload(int cpu);
 void rcu_nocb_flush_deferred_wakeup(void);
+void rcu_nocb_update_cpus(struct cpumask *cpumask);
 
 #define RCU_NOCB_LOCKDEP_WARN(c, s) RCU_LOCKDEP_WARN(c, s)
 
@@ -159,6 +160,7 @@ static inline void rcu_init_nohz(void) { }
 static inline int rcu_nocb_cpu_offload(int cpu) { return -EINVAL; }
 static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
 static inline void rcu_nocb_flush_deferred_wakeup(void) { }
+static inline void rcu_nocb_update_cpus(struct cpumask *cpumask) { }
 
 #define RCU_NOCB_LOCKDEP_WARN(c, s)
 
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 2d06dcb61f37..b2daba1e5cb9 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1173,6 +1173,28 @@ int rcu_nocb_cpu_offload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
 
+void rcu_nocb_update_cpus(struct cpumask *cpumask)
+{
+	int cpu, ret;
+
+	if (!rcu_state.nocb_is_setup) {
+		pr_warn_once("Dynamic RCU NOCB cannot be enabled without nohz_full/rcu_nocbs kernel boot parameter!\n");
+		return;
+	}
+
+	for_each_cpu_andnot(cpu, cpumask, rcu_nocb_mask) {
+		ret = rcu_nocb_cpu_offload(cpu);
+		if (WARN_ON_ONCE(ret))
+			return;
+	}
+
+	for_each_cpu_andnot(cpu, rcu_nocb_mask, cpumask) {
+		ret = rcu_nocb_cpu_deoffload(cpu);
+		if (WARN_ON_ONCE(ret))
+			return;
+	}
+}
+
 #ifdef CONFIG_RCU_LAZY
 static unsigned long
 lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 48b155e0b290..b5635484ec69 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -181,8 +181,10 @@ int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
 	if ((housekeeping.flags & flags) != flags)
 		WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
 
-	if (flags & HK_FLAG_KERNEL_NOISE)
+	if (flags & HK_FLAG_KERNEL_NOISE) {
 		tick_nohz_full_update_cpus(isol_mask);
+		rcu_nocb_update_cpus(isol_mask);
+	}
 
 	synchronize_rcu();
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (5 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:08   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

At bootup, watchdog will exclude nohz_full CPUs specified at boot time.
As we are now enabling runtime changes to nohz_full CPUs, the list of
CPUs with watchdog timer running should be updated to exclude the
current set of isolated CPUs.

Add a new watchdog_cpumask_update() helper to be invoked
by housekeeping_update() when the HK_TYPE_KERNEL_NOISE
(HK_TYPE_TIMER) cpumask is being updated to update watchdog_cpumask and
watchdog_allowed_mask for soft lockup detector. The cpumask updates will
be done when the affected CPUs are in the offline state. When those
CPUs are brought up later, the new cpumask will be used to determine
if any hard/soft watchdog should be enabled again.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/nmi.h      |  2 ++
 kernel/sched/isolation.c |  1 +
 kernel/watchdog.c        | 24 ++++++++++++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index bc1162895f35..5bf941d2b168 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -17,6 +17,7 @@
 void lockup_detector_init(void);
 void lockup_detector_retry_init(void);
 void lockup_detector_soft_poweroff(void);
+void watchdog_cpumask_update(struct cpumask *mask);
 
 extern int watchdog_user_enabled;
 extern int watchdog_thresh;
@@ -37,6 +38,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace;
 static inline void lockup_detector_init(void) { }
 static inline void lockup_detector_retry_init(void) { }
 static inline void lockup_detector_soft_poweroff(void) { }
+static inline void watchdog_cpumask_update(struct cpumask *mask) { }
 #endif /* !CONFIG_LOCKUP_DETECTOR */
 
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b5635484ec69..1f3f1c83dd12 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -184,6 +184,7 @@ int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
 	if (flags & HK_FLAG_KERNEL_NOISE) {
 		tick_nohz_full_update_cpus(isol_mask);
 		rcu_nocb_update_cpus(isol_mask);
+		watchdog_cpumask_update(isol_mask);
 	}
 
 	synchronize_rcu();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 87dd5e0f6968..498c1463b843 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -1071,6 +1071,30 @@ static inline void lockup_detector_setup(void)
 }
 #endif /* !CONFIG_SOFTLOCKUP_DETECTOR */
 
+/**
+ * watchdog_cpumask_update - update watchdog_cpumask & watchdog_allowed_mask
+ * @isol_mask: cpumask of isolated CPUs
+ *
+ * Update watchdog_cpumask and watchdog_allowed_mask to be inverse of the
+ * given isolated cpumask to disable watchdog activities on isolated CPUs.
+ * It should be called with the affected CPUs in offline state which will be
+ * brought up online later.
+ *
+ * Any changes made in watchdog_cpumask by users via the sysctl parameter will
+ * be overridden. However, proc_watchdog_update() isn't called. So change will
+ * only happens on CPUs that will brought up later on to minimize changes to
+ * the existing watchdog configuration.
+ */
+void watchdog_cpumask_update(struct cpumask *isol_mask)
+{
+	mutex_lock(&watchdog_mutex);
+	cpumask_andnot(&watchdog_cpumask, cpu_possible_mask, isol_mask);
+#ifdef CONFIG_SOFTLOCKUP_DETECTOR
+	cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
+#endif
+	mutex_unlock(&watchdog_mutex);
+}
+
 /**
  * lockup_detector_soft_poweroff - Interface to stop lockup detector(s)
  *
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (6 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:08   ` sashiko-bot
  2026-04-22  9:34   ` Chen Ridong
  2026-04-21  3:03 ` [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask Waiman Long
                   ` (14 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
need to use RCU to protect access to the cpumask to prevent it from
going away in the middle of the operation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/arm64/kernel/topology.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index b32f13358fbb..48f150801689 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
 	if (!amu_fie_cpu_supported(cpu))
 		return;
 
+	guard(rcu)();
 	/* Kick in AMU update but only if one has not happened already */
 	if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
 	    time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update, cpu)))
@@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
 	unsigned int start_cpu = cpu;
 	unsigned long last_update;
 	unsigned int freq = 0;
+	bool hk_cpu;
 	u64 scale;
 
 	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
 		return -EOPNOTSUPP;
 
+	scoped_guard(rcu) {
+		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
+	}
+
 	while (1) {
 
 		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
@@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
 		 * (and thus freq scale), if available, for given policy: this boils
 		 * down to identifying an active cpu within the same freq domain, if any.
 		 */
-		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
+		if (!hk_cpu ||
 		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
 			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+			bool hk_intersects;
 			int ref_cpu;
 
 			if (!policy)
 				return -EINVAL;
 
-			if (!cpumask_intersects(policy->related_cpus,
-						housekeeping_cpumask(HK_TYPE_TICK))) {
+			scoped_guard(rcu) {
+				hk_intersects = cpumask_intersects(policy->related_cpus,
+							housekeeping_cpumask(HK_TYPE_TICK));
+			}
+
+			if (!hk_intersects) {
 				cpufreq_cpu_put(policy);
 				return -EOPNOTSUPP;
 			}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (7 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  3:03 ` [PATCH 10/23] cpu: " Waiman Long
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
RCU to protect access to the cpumask.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 08b1c786b463..2dab3872281a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2557,8 +2557,10 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	if (housekeeping_enabled(HK_TYPE_TIMER)) {
 		/* If the current cpu is a housekeeping cpu, use it. */
 		cpu = smp_processor_id();
-		if (!housekeeping_test_cpu(cpu, HK_TYPE_TIMER))
-			cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+		scoped_guard(rcu) {
+			if (!housekeeping_test_cpu(cpu, HK_TYPE_TIMER))
+				cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+		}
 		add_timer_on(timer, cpu);
 	} else {
 		if (likely(cpu == WORK_CPU_UNBOUND))
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/23] cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (8 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  8:57   ` Thomas Gleixner
  2026-04-21  3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
RCU to protect access to the cpumask.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cpu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..0d02b5d7a7ba 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1890,6 +1890,8 @@ int freeze_secondary_cpus(int primary)
 	cpu_maps_update_begin();
 	if (primary == -1) {
 		primary = cpumask_first(cpu_online_mask);
+
+		guard(rcu)();
 		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
 			primary = housekeeping_any_cpu(HK_TYPE_TIMER);
 	} else {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 11/23] hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (9 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 10/23] cpu: " Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  8:59   ` Thomas Gleixner
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now Waiman Long
                   ` (11 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
RCU to protect access to the cpumask.

The access of HK_TYPE_TIMER cpumask within hrtimers_cpu_dying() is
protected as interrupt is disabled and all the other CPUs are stopped
when this function is invoked as part of the CPU tear down process.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/time/hrtimer.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 000fb6ba7d74..85495400a193 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -242,8 +242,10 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
 static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
 {
 	if (!hrtimer_base_is_online(base)) {
-		int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+		int cpu;
 
+		guard(rcu)();
+		cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
 		return &per_cpu(hrtimer_bases, cpu);
 	}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (10 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

Checking of the housekeeping cpumasks was originally added to the
store_rps_map() method in commit 07bbecb34106 ("net: Restrict receive
packets queuing to housekeeping CPUs"). This store method ensures
that only CPUs in housekeeping cpumask can be set in the "rps_cpus"
of the individual network devices. Later, commit 662ff1aac854 ("net:
Keep ignoring isolated cpuset change") replaced HK_TYPE_DOMAIN by
HK_TYPE_DOMAIN_BOOT as HK_TYPE_DOMAIN cpumask was going to be made
changeable at run time.

Now we are going to make the nohz_full housekeeping cpumask to be
changeable at run time too. Ideally we should update the housekeeping
cpumasks changes should be propagated to the RPS cpumasks, but we
need more thoughts to properly synchronize them. Ignore the run time
housekeeping cpumask changes and only use the boot time "isolcpus="
and "nohz_full=" values for now.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 net/core/net-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e430645748a7..9e666cc3b34c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1023,7 +1023,7 @@ int rps_cpumask_housekeeping(struct cpumask *mask)
 {
 	if (!cpumask_empty(mask)) {
 		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
-		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE_BOOT));
 		if (cpumask_empty(mask))
 			return -EINVAL;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (11 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_KERNEL_NOISE is going to be changeable at run time, use
RCU to protect access to the cpumask when needed. Some access of the
HK_TYPE_KERNEL_NOISE cpumask is done inside the tick code that has
interrupt disabled which is a rcu_read_lock() critical section. Anyway,
housekeeping_cpumask() will warn if they are used in invalid context.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8952f5764517..6ae00c23d8a7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1261,6 +1261,8 @@ int get_nohz_timer_target(void)
 	struct sched_domain *sd;
 	const struct cpumask *hk_mask;
 
+	guard(rcu)();
+
 	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)) {
 		if (!idle_cpu(cpu))
 			return cpu;
@@ -1269,8 +1271,6 @@ int get_nohz_timer_target(void)
 
 	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
 
-	guard(rcu)();
-
 	for_each_domain(cpu, sd) {
 		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
 			if (cpu == i)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (12 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_MISC cpumask is going to be changeable at run time, use RCU
to protect access to the cpumask.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/hwmon/coretemp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index 6a0d94711ead..0284ea6a3b79 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -563,8 +563,10 @@ static int create_core_data(struct platform_device *pdev, unsigned int cpu,
 	u32 eax, edx;
 	int err;
 
-	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
-		return 0;
+	scoped_guard(rcu) {
+		if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
+			return 0;
+	}
 
 	tdata = init_temp_data(pdata, cpu, pkg_flag);
 	if (!tdata)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (13 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
use RCU to protect access to the cpumask. The memory allocation
alloc_cpumask_var() call is done before taking the rcu_read_lock()
as this call can be sleepable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 drivers/hv/channel_mgmt.c | 15 +++++++++------
 drivers/hv/vmbus_drv.c    |  7 +++++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 84eb0a6a0b54..44441dafed90 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -752,13 +752,16 @@ static void init_vp_index(struct vmbus_channel *channel)
 	u32 i, ncpu = num_online_cpus();
 	cpumask_var_t available_mask;
 	struct cpumask *allocated_mask;
-	const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
+	const struct cpumask *hk_mask;
 	u32 target_cpu;
 	int numa_node;
+	bool alloc_ok;
 
-	if (!perf_chn ||
-	    !alloc_cpumask_var(&available_mask, GFP_KERNEL) ||
-	    cpumask_empty(hk_mask)) {
+	alloc_ok = alloc_cpumask_var(&available_mask, GFP_KERNEL);
+	guard(rcu)();
+	hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
+
+	if (!perf_chn || !alloc_ok || cpumask_empty(hk_mask)) {
 		/*
 		 * If the channel is not a performance critical
 		 * channel, bind it to VMBUS_CONNECT_CPU.
@@ -770,7 +773,7 @@ static void init_vp_index(struct vmbus_channel *channel)
 		channel->target_cpu = VMBUS_CONNECT_CPU;
 		if (perf_chn)
 			hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
-		return;
+		goto out_free;
 	}
 
 	for (i = 1; i <= ncpu + 1; i++) {
@@ -808,7 +811,7 @@ static void init_vp_index(struct vmbus_channel *channel)
 	}
 
 	channel->target_cpu = target_cpu;
-
+out_free:
 	free_cpumask_var(available_mask);
 }
 
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 3faa74e49a6b..60c7a5ac15c0 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1763,8 +1763,11 @@ int vmbus_channel_set_cpu(struct vmbus_channel *channel, u32 target_cpu)
 	if (target_cpu >= nr_cpumask_bits)
 		return -EINVAL;
 
-	if (!cpumask_test_cpu(target_cpu, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)))
-		return -EINVAL;
+	scoped_guard(rcu) {
+		if (!cpumask_test_cpu(target_cpu,
+				      housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)))
+			return -EINVAL;
+	}
 
 	if (!cpu_online(target_cpu))
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (14 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  9:02   ` Thomas Gleixner
  2026-04-21  3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
use RCU to protect access to the cpumask.

To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
following steps can be done.

 1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
    CPUs and add back the de-isolated CPUs.
 2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
    to be called on the affected CPUs to migrate the irqs to other
    HK_TYPE_MANAGED_IRQ housekeeping CPUs.
 3) Bring up the previously offline CPUs to invoke
    irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
    be used for managed irqs.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/irq/cpuhotplug.c | 1 +
 kernel/irq/manage.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index cd5689e383b0..86437c78f1f2 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -196,6 +196,7 @@ static bool hk_should_isolate(struct irq_data *data, unsigned int cpu)
 	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
 		return false;
 
+	guard(rcu)();
 	hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
 	if (cpumask_subset(irq_data_get_effective_affinity_mask(data), hk_mask))
 		return false;
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2e8072437826..8270c4de260b 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
 	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
 		const struct cpumask *hk_mask;
 
+		guard(rcu)();
 		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
 
 		cpumask_and(tmp_mask, mask, hk_mask);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (15 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As we are going to make HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ housekeeping cpumasks run time changeable, extend
housekeeping_dereference_check() to cover changes to those cpumasks
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/isolation.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 1f3f1c83dd12..1647e7b08bac 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -38,7 +38,8 @@ EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
 static bool housekeeping_dereference_check(enum hk_type type)
 {
-	if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
+	if (IS_ENABLED(CONFIG_LOCKDEP) &&
+	   (BIT(type) & (HK_FLAG_DOMAIN | HK_FLAG_KERNEL_NOISE | HK_FLAG_MANAGED_IRQ))) {
 		/* Cpuset isn't even writable yet? */
 		if (system_state <= SYSTEM_SCHEDULING)
 			return true;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (16 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21 16:17   ` Thomas Gleixner
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update() Waiman Long
                   ` (4 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

Add a new cpuhp_offline_cb() API that allows us to offline a set of
CPUs one-by-one, run the given callback function and then bring those
CPUs back online again while inhibiting any concurrent CPU hotplug
operations from happening.

This new API can be used to enable runtime adjustment of nohz_full and
isolcpus boot command line options. A new cpuhp_offline_cb_mode flag
is also added to signal that the system is in this offline callback
transient state so that some hotplug operations can be optimized out
if we choose to.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cpuhplock.h |  9 +++++
 kernel/cpu.c              | 70 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/linux/cpuhplock.h b/include/linux/cpuhplock.h
index 286b3ab92e15..37637baa32eb 100644
--- a/include/linux/cpuhplock.h
+++ b/include/linux/cpuhplock.h
@@ -9,7 +9,9 @@
 
 #include <linux/cleanup.h>
 #include <linux/errno.h>
+#include <linux/cpumask_types.h>
 
+typedef int (*cpuhp_cb_t)(void *arg);
 struct device;
 
 extern int lockdep_is_cpus_held(void);
@@ -29,6 +31,8 @@ void clear_tasks_mm_cpumask(int cpu);
 int remove_cpu(unsigned int cpu);
 int cpu_device_down(struct device *dev);
 void smp_shutdown_nonboot_cpus(unsigned int primary_cpu);
+int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg);
+extern bool cpuhp_offline_cb_mode;
 
 #else /* CONFIG_HOTPLUG_CPU */
 
@@ -43,6 +47,11 @@ static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
 static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
 static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
+static inline int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
+{
+	return -EPERM;
+}
+#define cpuhp_offline_cb_mode	false
 #endif	/* !CONFIG_HOTPLUG_CPU */
 
 DEFINE_LOCK_GUARD_0(cpus_read_lock, cpus_read_lock(), cpus_read_unlock())
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0d02b5d7a7ba..9b32f742cd1d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1520,6 +1520,76 @@ int remove_cpu(unsigned int cpu)
 }
 EXPORT_SYMBOL_GPL(remove_cpu);
 
+bool cpuhp_offline_cb_mode;
+
+/**
+ * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
+ * @mask: A mask of CPUs to be taken offline and then online
+ * @func: A callback function to be invoked while the given CPUs are offline
+ * @arg:  Argument to be passed back to the callback function
+ *
+ * Return: 0 if successful, an error code otherwise
+ */
+int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
+{
+	int off_cpu, on_cpu, ret, ret2 = 0;
+
+	if (WARN_ON_ONCE(cpumask_empty(mask) ||
+	   !cpumask_subset(mask, cpu_online_mask)))
+		return -EINVAL;
+
+	pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));
+	lock_device_hotplug();
+	cpuhp_offline_cb_mode = true;
+	/*
+	 * If all offline operations succeed, off_cpu should become nr_cpu_ids.
+	 */
+	for_each_cpu(off_cpu, mask) {
+		ret = device_offline(get_cpu_device(off_cpu));
+		if (unlikely(ret))
+			break;
+	}
+	if (!ret)
+		ret = func(arg);
+
+	/* Bring previously offline CPUs back online */
+	for_each_cpu(on_cpu, mask) {
+		int retries = 0;
+
+		if (on_cpu == off_cpu)
+			break;
+
+retry:
+		ret2 = device_online(get_cpu_device(on_cpu));
+
+		/*
+		 * With the unlikely event that CPU hotplug is disabled while
+		 * this operation is in progress, we will need to wait a bit
+		 * for hotplug to hopefully be re-enabled again. If not, print
+		 * a warning and return the error.
+		 *
+		 * cpu_hotplug_disabled is supposed to be accessed while
+		 * holding the cpu_add_remove_lock mutex. So we need to
+		 * use the data_race() macro to access it here.
+		 */
+		while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
+		       (++retries <= 5)) {
+			msleep(20);
+			if (!data_race(cpu_hotplug_disabled))
+				goto retry;
+		}
+		if (ret2) {
+			pr_warn("%s: Failed to bring CPU %d back online!\n",
+				__func__, on_cpu);
+			break;
+		}
+	}
+	cpuhp_offline_cb_mode = false;
+	unlock_device_hotplug();
+	pr_debug("%s: end\n", __func__);
+	return ret ? ret : ret2;
+}
+
 void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
 {
 	unsigned int cpu;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update()
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (17 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-21  3:03 ` [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Waiman Long
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

By making sure that isolated_hk_cpus matches isolated_cpus at boot time,
we can more accurately determine if calling housekeeping_update()
is needed by comparing if the two cpumasks are equal. The
update_housekeeping flag still have a use in cpuset_handle_hotplug()
to determine if a work function should be queued to invoke
cpuset_update_sd_hk_unlock() as it is not supposed to look at
isolated_hk_cpus without holding cpuset_top_mutex.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a4eccb0ec0d1..1b0c50b46a49 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1339,26 +1339,29 @@ static void cpuset_update_sd_hk_unlock(void)
 	__releases(&cpuset_mutex)
 	__releases(&cpuset_top_mutex)
 {
+	update_housekeeping = false;
+
 	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
 	if (force_sd_rebuild)
 		rebuild_sched_domains_locked();
 
-	if (update_housekeeping) {
-		update_housekeeping = false;
-		cpumask_copy(isolated_hk_cpus, isolated_cpus);
-
-		/*
-		 * housekeeping_update() is now called without holding
-		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
-		 * is still being held for mutual exclusion.
-		 */
-		mutex_unlock(&cpuset_mutex);
-		cpus_read_unlock();
-		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
-		mutex_unlock(&cpuset_top_mutex);
-	} else {
+	if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) {
+		/* No housekeeping cpumask update needed */
 		cpuset_full_unlock();
+		return;
 	}
+
+	cpumask_copy(isolated_hk_cpus, isolated_cpus);
+
+	/*
+	 * housekeeping_update() is now called without holding
+	 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
+	 * is still being held for mutual exclusion.
+	 */
+	mutex_unlock(&cpuset_mutex);
+	cpus_read_unlock();
+	WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
+	mutex_unlock(&cpuset_top_mutex);
 }
 
 /*
@@ -3692,10 +3695,11 @@ int __init cpuset_init(void)
 
 	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
 
-	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
+	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT)) {
 		cpumask_andnot(isolated_cpus, cpu_possible_mask,
 			       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
-
+		cpumask_copy(isolated_hk_cpus, isolated_cpus);
+	}
 	return 0;
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (18 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update() Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

One simple way to enable runtime update of HK_TYPE_KERNEL_NOISE
(nohz_full) and HK_TYPE_MANAGED_IRQ cpumasks is to make use of the CPU
hotplug to facilitate the transition of those CPUs that are changing
states as long as CONFIG_HOTPLUG_CPU is enabled and a nohz_full boot
parameter is provided. Otherwise, only HK_TYPE_DOMAIN cpumask will be
updated at run time.

For changes in HK_TYPE_DOMAIN cpumask, it can be done without using CPU
hotplug. For changes in HK_TYPE_MANAGED_IRQ cpumask, we have to update
the cpumask first and then tear down and bring up the newly isolated
CPUs to migrate the managed irqs in those CPUs to other housekeeping
CPUs.

For changes in HK_TYPE_KERNEL_NOISE, we have to tear down all the newly
isolated and de-isolated CPUs, change the cpumask and then bring all the
offline CPUs back online.

As it is possible that the various boot versions of the housekeeping
cpumasks are different resulting in the use of different set of isolated
cpumasks for calling housekeeping_update(), we may need to pre-allocate
these cpumasks if necessary.

Note that the use of CPU hotplug to facilitate the changing of
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ housekeeping cpumasks has
the drawback that during the tear down of a CPU from CPUHP_TEARDOWN_CPU
state to CPUHP_AP_OFFLINE, the stop_machine code will be invoked to stop
all the other CPUs including all the pre-existing isolated CPUs. That
will cause latency spikes on those isolated CPUs. That latency spike
should only happen when the cpuset isolated partition setting is changed
resulting in changes in those housekeeping cpumasks.

One possible workaround that is being used right now is to pre-allocate
a set of nohz_full and managed_irq CPUs at boot time. The semi-isolated
CPUs are then used to create cpuset isolated partitions when needed
to enable full isolation. This will likely continue even if we made
the nohz_full and managed_irq CPUs runtime changeable if they can't
tolerate these latency spikes.

This is a problem we need to address in a future patch series.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 199 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 182 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1b0c50b46a49..a927b9cd4f71 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -152,9 +152,17 @@ static cpumask_var_t	isolated_cpus;		/* CSCB */
 static bool		update_housekeeping;	/* RWCS */
 
 /*
- * Copy of isolated_cpus to be passed to housekeeping_update()
+ * Cpumasks to be passed to housekeeping_update()
+ * isolated_hk_cpus - copy of isolated_cpus for HK_TYPE_DOMAIN
+ * isolated_nohz_cpus - for HK_TYPE_KERNEL_NOISE
+ * isolated_mirq_cpus - for HK_TYPE_MANAGED_IRQ
  */
 static cpumask_var_t	isolated_hk_cpus;	/* T */
+static cpumask_var_t	isolated_nohz_cpus;	/* T */
+static cpumask_var_t	isolated_mirq_cpus;	/* T */
+
+static bool		boot_nohz_le_domain __ro_after_init;
+static bool		boot_mirq_le_domain __ro_after_init;
 
 /*
  * A flag to force sched domain rebuild at the end of an operation.
@@ -1328,29 +1336,67 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 	return false;
 }
 
+static int cpuset_nohz_update_cbfunc(void *arg)
+{
+	struct cpumask *isol_cpus = (struct cpumask *)arg;
+
+	if (isol_cpus)
+		housekeeping_update(isol_cpus, BIT(HK_TYPE_KERNEL_NOISE));
+	return 0;
+}
+
 /*
- * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
- *
- * Update housekeeping cpumasks and rebuild sched domains if necessary and
- * then do a cpuset_full_unlock().
- * This should be called at the end of cpuset operation.
  */
-static void cpuset_update_sd_hk_unlock(void)
-	__releases(&cpuset_mutex)
-	__releases(&cpuset_top_mutex)
+static void cpuset_update_housekeeping_unlock(void)
 {
-	update_housekeeping = false;
+	bool update_nohz, update_mirq;
+	cpumask_var_t cpus;
+	int ret;
 
-	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
-	if (force_sd_rebuild)
-		rebuild_sched_domains_locked();
+	if (!tick_nohz_full_enabled())
+		return;
 
-	if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) {
-		/* No housekeeping cpumask update needed */
+	update_nohz = boot_nohz_le_domain;
+	update_mirq = boot_mirq_le_domain;
+
+	if (WARN_ON_ONCE(!alloc_cpumask_var(&cpus, GFP_KERNEL))) {
 		cpuset_full_unlock();
 		return;
 	}
 
+	/*
+	 * Update isolated_nohz_cpus/isolated_mirq_cpus if necessary
+	 */
+	if (!boot_nohz_le_domain) {
+		cpumask_andnot(cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+		cpumask_or(cpus, cpus, isolated_cpus);
+		update_nohz = !cpumask_equal(isolated_nohz_cpus, cpus);
+		if (update_nohz)
+			cpumask_copy(isolated_nohz_cpus, cpus);
+	}
+	if (!boot_mirq_le_domain) {
+		cpumask_andnot(cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+		cpumask_or(cpus, cpus, isolated_cpus);
+		update_mirq = !cpumask_equal(isolated_mirq_cpus, cpus);
+		if (update_mirq)
+			cpumask_copy(isolated_mirq_cpus, cpus);
+	}
+
+	/*
+	 * Compute list of CPUs to be brought offline into "cpus"
+	 * isolated_hk_cpus - old cpumask
+	 * isolated_cpus    - new cpumask
+	 *
+	 * With update_nohz, we need to offline both the newly isolated
+	 * and de-isolated CPUs. With only update_mirq, we only need to
+	 * offline the new isolated CPUs.
+	 */
+	if (update_nohz)
+		cpumask_xor(cpus, isolated_hk_cpus, isolated_cpus);
+	else if (update_mirq)
+		cpumask_andnot(cpus, isolated_cpus, isolated_hk_cpus);
 	cpumask_copy(isolated_hk_cpus, isolated_cpus);
 
 	/*
@@ -1360,10 +1406,103 @@ static void cpuset_update_sd_hk_unlock(void)
 	 */
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
-	WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
+
+	if (!update_mirq) {
+		ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN));
+	} else if (boot_mirq_le_domain) {
+		ret = housekeeping_update(isolated_hk_cpus,
+					  BIT(HK_TYPE_DOMAIN)|BIT(HK_TYPE_MANAGED_IRQ));
+	} else {
+		ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN));
+		if (!ret)
+			ret = housekeeping_update(isolated_mirq_cpus,
+						  BIT(HK_TYPE_MANAGED_IRQ));
+	}
+
+	if (WARN_ON_ONCE(ret))
+		goto out_free;
+
+	/*
+	 * Calling cpuhp_offline_cb() is only needed if either
+	 * HK_TYPE_KERNEL_NOISE and/or HK_TYPE_MANAGED_IRQ cpumasks
+	 * needed to be updated.
+	 *
+	 * TODO: When tearing down a CPU from CPUHP_TEARDOWN_CPU state
+	 * downward to CPUHP_AP_OFFLINE, the stop_machine code will be
+	 * invoked to stop all the other CPUs in the system. This will
+	 * cause latency spikes on existing isolated CPUs. We will need
+	 * some new mechanism to enable us to not stop the existing
+	 * isolated CPUs whenever possible. A possible workaround is
+	 * to pre-allocate a set of nohz_full and manged_irq CPUs at
+	 * boot time and use them to create isolated cpuset partitions
+	 * so that CPU hotplug won't need to be used.
+	 */
+	if (update_mirq || update_nohz) {
+		struct cpumask *nohz_cpus;
+
+		/*
+		 * Calling housekeeping_update() is only needed if
+		 * update_nohz is set.
+		 */
+		nohz_cpus = !update_nohz ? NULL : boot_nohz_le_domain
+					 ? isolated_hk_cpus
+					 : isolated_nohz_cpus;
+		/*
+		 * Mask out offline CPUs in cpus
+		 * If there is no online CPUs, we can call
+		 * housekeeping_update() directly if needed.
+		 */
+		cpumask_and(cpus, cpus, cpu_online_mask);
+		if (!cpumask_empty(cpus))
+			ret = cpuhp_offline_cb(cpus, cpuset_nohz_update_cbfunc,
+					       nohz_cpus);
+		else if (nohz_cpus)
+			ret = housekeeping_update(nohz_cpus, BIT(HK_TYPE_KERNEL_NOISE));
+	}
+	WARN_ON_ONCE(ret);
+out_free:
+	free_cpumask_var(cpus);
 	mutex_unlock(&cpuset_top_mutex);
 }
 
+/*
+ * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
+ *
+ * Update housekeeping cpumasks and rebuild sched domains if necessary and
+ * then do a cpuset_full_unlock().
+ * This should be called at the end of cpuset operation.
+ */
+static void cpuset_update_sd_hk_unlock(void)
+	__releases(&cpuset_mutex)
+	__releases(&cpuset_top_mutex)
+{
+	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
+	if (force_sd_rebuild)
+		rebuild_sched_domains_locked();
+
+	update_housekeeping = false;
+
+	if (cpumask_equal(isolated_cpus, isolated_hk_cpus)) {
+		cpuset_full_unlock();
+		return;
+	}
+
+	if (!tick_nohz_full_enabled()) {
+		/*
+		 * housekeeping_update() is now called without holding
+		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
+		 * is still being held for mutual exclusion.
+		 */
+		mutex_unlock(&cpuset_mutex);
+		cpus_read_unlock();
+		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus,
+						 BIT(HK_TYPE_DOMAIN)));
+		mutex_unlock(&cpuset_top_mutex);
+	} else {
+		cpuset_update_housekeeping_unlock();
+	}
+}
+
 /*
  * Work function to invoke cpuset_update_sd_hk_unlock()
  */
@@ -3700,6 +3839,29 @@ int __init cpuset_init(void)
 			       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
 		cpumask_copy(isolated_hk_cpus, isolated_cpus);
 	}
+
+	/*
+	 * If nohz_full and/or managed_irq cpu list, if present, is a subset
+	 * of the domain cpu list, i.e. HK_TYPE_DOMAIN_BOOT cpumask is a subset
+	 * of HK_TYPE_KERNEL_NOISE_BOOT/HK_TYPE_MANAGED_IRQ_BOOT cpumask, the
+	 * corresponding non-boot housekeeping cpumasks will follow changes
+	 * in the HK_TYPE_DOMAIN cpumask. So we don't need to use separate
+	 * cpumasks to isolate them.
+	 */
+	boot_nohz_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT),
+					     housekeeping_cpumask(HK_TYPE_KERNEL_NOISE_BOOT));
+	boot_mirq_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT),
+					     housekeeping_cpumask(HK_TYPE_MANAGED_IRQ_BOOT));
+	if (!boot_nohz_le_domain) {
+		BUG_ON(!alloc_cpumask_var(&isolated_nohz_cpus, GFP_KERNEL));
+		cpumask_andnot(isolated_nohz_cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+	}
+	if (!boot_mirq_le_domain) {
+		BUG_ON(!alloc_cpumask_var(&isolated_mirq_cpus, GFP_KERNEL));
+		cpumask_andnot(isolated_mirq_cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+	}
 	return 0;
 }
 
@@ -3954,7 +4116,10 @@ static void cpuset_handle_hotplug(void)
 	 */
 	if (force_sd_rebuild)
 		rebuild_sched_domains_cpuslocked();
-	if (update_housekeeping)
+	/*
+	 * Don't need to update housekeeping cpumasks in cpuhp_offline_cb mode.
+	 */
+	if (update_housekeeping && !cpuhp_offline_cb_mode)
 		queue_work(system_dfl_wq, &hk_sd_work);
 
 	free_tmpmasks(ptmp);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (19 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
  2026-04-21  3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

CPU hotplug is used to facilitate the modification of the
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ cpumasks. However, tearing
down and bringing up CPUs can impact the cpuset partition states
as well. For instance, tearing down the last CPU of a partition can
invalidate the partition with active tasks which will not happen if
CPU hotplug isn't used.

A workaround of this issue is disable the invalidation by pretending that
the partition has no task, and making the tasks within the partition
to the effective CPUs of its parent for a short while during the short
transition process where the CPUs will be teared down and the brought
up again.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a927b9cd4f71..5f6b4e67748f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -434,6 +434,13 @@ static inline bool partition_is_populated(struct cpuset *cs,
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 
+	/*
+	 * Hack: In cpuhp_offline_cb_mode, pretend all partitions are empty
+	 * to prevent unnecessary partition invalidation.
+	 */
+	if (cpuhp_offline_cb_mode)
+		return false;
+
 	/*
 	 * We cannot call cs_is_populated(cs) directly, as
 	 * nr_populated_domain_children may include populated
@@ -3881,6 +3888,17 @@ hotplug_update_tasks(struct cpuset *cs,
 	cs->effective_mems = *new_mems;
 	spin_unlock_irq(&callback_lock);
 
+	/*
+	 * When cpuhp_offline_cb_mode is active, valid isolated partition
+	 * with tasks may have no online CPUs available for a short while.
+	 * In that case, we fall back to parent's effective CPUs temporarily
+	 * which will be reset back to their rightful value once the affected
+	 * CPUs are online again.
+	 */
+	if (cpuhp_offline_cb_mode && cpumask_empty(new_cpus) &&
+	   (cs->partition_root_state == PRS_ISOLATED))
+		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
+
 	if (cpus_updated)
 		cpuset_update_tasks_cpumask(cs, new_cpus);
 	if (mems_updated)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in isolated partition
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (20 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  2026-04-21  3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

If tick_nohz_full_enabled() is true, we are going to use CPU
hotplug to enable runtime changes to the HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ cpumasks. However, for some architectures, one
or maybe more CPUs will have the offline_disabled flag set in their
cpu devices. For instance, x64-64 will set this flag for its boot CPU
(typically CPU 0) to disable it from going offline. Those CPUs can't
be used in cpuset isolated partition, or we are going to have problem
in the CPU offline process.

Find out the set of CPUs with offline_disabled set in a
new cpuset_init_late() helper, set the corresponding bits in
offline_disabled_cpus cpumask and check it when isolated partitions are
being created or modified to ensure that we will not use any of those
offline disabled CPUs in an isolated partition.

An error message mentioning those offline disabled CPUs will be
constructed in cpuset_init_late() and shown in "cpuset.cpus.partition"
when isolated creation or modification fails.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset-internal.h |  1 +
 kernel/cgroup/cpuset.c          | 89 ++++++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index fd7d19842ded..87b7411540ff 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -35,6 +35,7 @@ enum prs_errcode {
 	PERR_HKEEPING,
 	PERR_ACCESS,
 	PERR_REMOTE,
+	PERR_OL_DISABLED,
 };
 
 /* bits in struct cpuset flags field */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5f6b4e67748f..f3af8ef6c64e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -20,6 +20,8 @@
  */
 #include "cpuset-internal.h"
 
+#include <linux/cpu.h>
+#include <linux/device.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
@@ -48,7 +50,7 @@ DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
  */
 DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
 
-static const char * const perr_strings[] = {
+static const char *perr_strings[] __ro_after_init = {
 	[PERR_INVCPUS]   = "Invalid cpu list in cpuset.cpus.exclusive",
 	[PERR_INVPARENT] = "Parent is an invalid partition root",
 	[PERR_NOTPART]   = "Parent is not a partition root",
@@ -59,6 +61,7 @@ static const char * const perr_strings[] = {
 	[PERR_HKEEPING]  = "partition config conflicts with housekeeping setup",
 	[PERR_ACCESS]    = "Enable partition not permitted",
 	[PERR_REMOTE]    = "Have remote partition underneath",
+	[PERR_OL_DISABLED] = "", /* To be set up later */
 };
 
 /*
@@ -164,6 +167,12 @@ static cpumask_var_t	isolated_mirq_cpus;	/* T */
 static bool		boot_nohz_le_domain __ro_after_init;
 static bool		boot_mirq_le_domain __ro_after_init;
 
+/*
+ * Cpumask of CPUs with offline_disabled set
+ * The cpumask is effectively __ro_after_init.
+ */
+static cpumask_var_t	offline_disabled_cpus;
+
 /*
  * A flag to force sched domain rebuild at the end of an operation.
  * It can be set in
@@ -1649,6 +1658,8 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	 * The effective_xcpus mask can contain offline CPUs, but there must
 	 * be at least one or more online CPUs present before it can be enabled.
 	 *
+	 * An isolated partition cannot contain CPUs with offline disabled.
+	 *
 	 * Note that creating a remote partition with any local partition root
 	 * above it or remote partition root underneath it is not allowed.
 	 */
@@ -1661,6 +1672,9 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	     !isolated_cpus_can_update(tmp->new_cpus, NULL)) ||
 	    prstate_housekeeping_conflict(new_prs, tmp->new_cpus))
 		return PERR_HKEEPING;
+	if (tick_nohz_full_enabled() && (new_prs == PRS_ISOLATED) &&
+	    cpumask_intersects(tmp->new_cpus, offline_disabled_cpus))
+		return PERR_OL_DISABLED;
 
 	spin_lock_irq(&callback_lock);
 	partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
@@ -1746,6 +1760,16 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
 		goto invalidate;
 	}
 
+	/*
+	 * Isolated partition cannot contains CPUs with offline_disabled
+	 * bit set.
+	 */
+	if (tick_nohz_full_enabled() && (prs == PRS_ISOLATED) &&
+	    cpumask_intersects(excpus, offline_disabled_cpus)) {
+		cs->prs_err = PERR_OL_DISABLED;
+		goto invalidate;
+	}
+
 	adding   = cpumask_andnot(tmp->addmask, excpus, cs->effective_xcpus);
 	deleting = cpumask_andnot(tmp->delmask, cs->effective_xcpus, excpus);
 
@@ -1913,6 +1937,11 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 		if (tasks_nocpu_error(parent, cs, xcpus))
 			return PERR_NOCPUS;
 
+		if (tick_nohz_full_enabled() &&
+		    (new_prs == PRS_ISOLATED) &&
+		    cpumask_intersects(xcpus, offline_disabled_cpus))
+			return PERR_OL_DISABLED;
+
 		/*
 		 * This function will only be called when all the preliminary
 		 * checks have passed. At this point, the following condition
@@ -1979,12 +2008,21 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 					       parent->effective_xcpus);
 		}
 
+		/*
+		 * Isolated partition cannot contain CPUs with offline_disabled
+		 * bit set.
+		 */
+		if (tick_nohz_full_enabled() &&
+		   ((old_prs == PRS_ISOLATED) ||
+		    (old_prs == PRS_INVALID_ISOLATED)) &&
+		    cpumask_intersects(newmask, offline_disabled_cpus)) {
+			part_error = PERR_OL_DISABLED;
 		/*
 		 * TBD: Invalidate a currently valid child root partition may
 		 * still break isolated_cpus_can_update() rule if parent is an
 		 * isolated partition.
 		 */
-		if (is_partition_valid(cs) && (old_prs != parent_prs)) {
+		} else if (is_partition_valid(cs) && (old_prs != parent_prs)) {
 			if ((parent_prs == PRS_ROOT) &&
 			    /* Adding to parent means removing isolated CPUs */
 			    !isolated_cpus_can_update(tmp->delmask, tmp->addmask))
@@ -1995,6 +2033,19 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 				part_error = PERR_HKEEPING;
 		}
 
+		if (part_error) {
+			deleting = false;
+			/*
+			 * For a previously valid partition, we need to move
+			 * the exclusive CPUs back to its parent.
+			 */
+			if (is_partition_valid(cs) &&
+			    !cpumask_empty(cs->effective_xcpus)) {
+				cpumask_copy(tmp->addmask, cs->effective_xcpus);
+				adding = true;
+			}
+		}
+
 		/*
 		 * The new CPUs to be removed from parent's effective CPUs
 		 * must be present.
@@ -3829,6 +3880,7 @@ int __init cpuset_init(void)
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&offline_disabled_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
@@ -4188,6 +4240,39 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_init_late - initialize the list of CPUs with offline_disabled set
+ *
+ * Description: Initialize a cpumask with CPUs that have the offline_disabled
+ *		bit set. It is done in a separate initcall as cpuset_init_smp()
+ *		is called before driver_init() where the CPU devices will be
+ *		set up.
+ */
+static int __init cpuset_init_late(void)
+{
+	int cpu;
+
+	if (!tick_nohz_full_enabled())
+		return 0;
+	/*
+	 * Iterate all the possible CPUs to see which one has offline disabled.
+	 */
+	for_each_possible_cpu(cpu) {
+		if (get_cpu_device(cpu)->offline_disabled)
+			__cpumask_set_cpu(cpu, offline_disabled_cpus);
+	}
+	if (!cpumask_empty(offline_disabled_cpus)) {
+		char buf[128];
+
+		snprintf(buf, sizeof(buf),
+			 "CPU %*pbl with offline disabled not allowed in isolated partition",
+			 cpumask_pr_args(offline_disabled_cpus));
+		perr_strings[PERR_OL_DISABLED] = kstrdup(buf, GFP_KERNEL);
+	}
+	return 0;
+}
+pure_initcall(cpuset_init_late);
+
 /*
  * Return cpus_allowed mask from a task's cpuset.
  */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates
  2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
                   ` (21 preceding siblings ...)
  2026-04-21  3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
@ 2026-04-21  3:03 ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  22 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

As CPU hotplug is now being used to enable runtime update to the list
of nohz_full and managed_irq CPUs, we should avoid using CPU 0 in the
formation of isolated partition as CPU 0 may not be able to be brought
offline like in the case of x86-64 architecture. So a number of the
test cases in test_cpuset_prs.sh will have to be updated accordingly.

A new test will also be run in offline isn't allowed in CPU 0 to verify
that using CPU 0 as part of an isolated partition will fail.

The cgroup-v2.rst is also updated to reflect the new capability of using
CPU hotplug to enable run time change to the nohz_full and managed_irq
CPU lists.

Since there is a slight performance overhead to enable runtime changes
to nohz_full CPU list, users have to explicitly opt in by adding a
"nohz_ful" kernel command line parameter with or without a CPU list.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst       | 35 +++++++---
 .../selftests/cgroup/test_cpuset_prs.sh       | 70 +++++++++++++++++--
 2 files changed, 92 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8ad0b2781317..e97fc031eb86 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2604,11 +2604,12 @@ Cpuset Interface Files
 
 	It accepts only the following input values when written to.
 
-	  ==========	=====================================
+	  ==========	===============================================
 	  "member"	Non-root member of a partition
 	  "root"	Partition root
-	  "isolated"	Partition root without load balancing
-	  ==========	=====================================
+	  "isolated"	Partition root without load balancing and other
+		        OS noises
+	  ==========	===============================================
 
 	A cpuset partition is a collection of cpuset-enabled cgroups with
 	a partition root at the top of the hierarchy and its descendants
@@ -2652,11 +2653,29 @@ Cpuset Interface Files
 	partition or scheduling domain.  The set of exclusive CPUs is
 	determined by the value of its "cpuset.cpus.exclusive.effective".
 
-	When set to "isolated", the CPUs in that partition will be in
-	an isolated state without any load balancing from the scheduler
-	and excluded from the unbound workqueues.  Tasks placed in such
-	a partition with multiple CPUs should be carefully distributed
-	and bound to each of the individual CPUs for optimal performance.
+	When set to "isolated", the CPUs in that partition will be in an
+	isolated state without any load balancing from the scheduler and
+	excluded from the unbound workqueues as well as other OS noises.
+	Tasks placed in such a partition with multiple CPUs should be
+	carefully distributed and bound to each of the individual CPUs
+	for optimal performance.
+
+	As CPU hotplug, if supported, is used to improve the degree of
+	CPU isolation close to the "nohz_full" kernel boot parameter.
+	In some architectures, like x86-64, the boot CPU (typically CPU
+	0) cannot be brought offline, so the boot CPU should not be used
+	for forming isolated partitions.  The "nohz_full" kernel boot
+	parameter needs to be present to enable full dynticks support
+	and RCU no-callback CPU mode for CPUs in isolated partitions
+	even if the optional cpu list isn't provided.
+
+	Using CPU hotplug for creating or destroying an isolated
+	partition can cause latency spike in applications running
+	in other isolated partitions.  A reserved list of CPUs can
+	optionally be put in the "nohz_full" kernel boot parameter to
+	alleviate this problem.  When these reserved CPUs are used for
+	isolated partitions, CPU hotplug won't need to be invoked and
+	so there won't be latency spike in other isolated partitions.
 
 	A partition root ("root" or "isolated") can be in one of the
 	two possible states - valid or invalid.  An invalid partition
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index a56f4153c64d..eebb4122b581 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -67,6 +67,12 @@ then
 	echo Y > /sys/kernel/debug/sched/verbose
 fi
 
+# Enable dynamic debug message if available
+DYN_DEBUG=/proc/dynamic_debug/control
+[[ -f $DYN_DEBUG ]] && {
+	echo "file kernel/cpu.c +p" > $DYN_DEBUG
+}
+
 cd $CGROUP2
 echo +cpuset > cgroup.subtree_control
 
@@ -84,6 +90,15 @@ echo member > test/cpuset.cpus.partition
 echo "" > test/cpuset.cpus
 [[ $RESULT -eq 0 ]] && skip_test "Child cgroups are using cpuset!"
 
+#
+# If nohz_full parameter is specified and nohz_full file exists, CPU hotplug
+# will be used to modify nohz_full cpumask to include all the isolated CPUs
+# in cpuset isolated partitions.
+#
+NOHZ_FULL=/sys/devices/system/cpu/nohz_full
+BOOT_NOHZ_FULL=$(fmt -1 /proc/cmdline | grep "^nohz_full")
+[[ "$BOOT_NOHZ_FULL" = nohz_full ]] && CHK_NOHZ_FULL=1
+
 #
 # If isolated CPUs have been reserved at boot time (as shown in
 # cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-8
@@ -318,8 +333,8 @@ TEST_MATRIX=(
 	# Invalid to valid local partition direct transition tests
 	" C1-3:P2  X4:P2    .      .      .      .      .      .     0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
 	" C1-3:P2  X4:P2    .      .      .    X3:P2    .      .     0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
-	" C0-3:P2    .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:5-6 A1:P2|B1:P0"
-	" C0-3:P2    .      .    C4-6 C0-4:C0-3  .      .      .     0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
+	" C1-3:P2    .      .    C4-6   C1-4     .      .      .     0 A1:1-4|B1:5-6 A1:P2|B1:P0"
+	" C1-3:P2    .      .    C4-6 C1-4:C1-3  .      .      .     0 A1:1-3|B1:4-6 A1:P2|B1:P0 1-3"
 
 	# Local partition invalidation tests
 	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
@@ -329,8 +344,8 @@ TEST_MATRIX=(
 	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
 				   .      .    C4:X     .      .     0 A1:1-3|A2:1-3|A3:2-3|XA2:|XA3: A1:P2|A2:P-2|A3:P-2 1-3"
 	# Local partition CPU change tests
-	" C0-5:P2  C4-5:P1  .      .      .    C3-5     .      .     0 A1:0-2|A2:3-5 A1:P2|A2:P1 0-2"
-	" C0-5:P2  C4-5:P1  .      .    C1-5     .      .      .     0 A1:1-3|A2:4-5 A1:P2|A2:P1 1-3"
+	" C1-5:P2  C4-5:P1  .      .      .    C3-5     .      .     0 A1:1-2|A2:3-5 A1:P2|A2:P1 1-2"
+	" C1-5:P2  C4-5:P1  .      .    C2-5     .      .      .     0 A1:2-3|A2:4-5 A1:P2|A2:P1 2-3"
 
 	# cpus_allowed/exclusive_cpus update tests
 	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3 \
@@ -442,6 +457,21 @@ TEST_MATRIX=(
 	"   C0-3     .      .    C4-5   X3-5     .      .      .     1 A1:0-3|B1:4-5"
 )
 
+#
+# Test matrix to verify that using CPU 0 in isolated (local or remote) partition
+# will fail when offline isn't allowed for CPU 0.
+#
+CPU0_ISOLCPUS_MATRIX=(
+	#  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
+	#  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
+	"   C0-3     .      .    C4-5     P2     .      .      .     0 A1:0-3|B1:4-5 A1:P-2"
+	"   C1-3     .      .      .      P2     .      .      .     0 A1:1-3 A1:P2"
+	"   C1-3     .      .      .    P2:C0-3  .      .      .     0 A1:0-3 A1:P-2"
+	"  CX0-3   C0-3     .      .       .     P2     .      .     0 A1:0-3|A2:0-3 A2:P-2"
+	"  CX0-3 C0-3:X1-3  .      .       .     P2     .      .     0 A1:0|A2:1-3 A2:P2"
+	"  CX0-3 C0-3:X1-3  .      .       .   P2:X0-3  .      .     0 A1:0-3|A2:0-3 A2:P-2"
+)
+
 #
 # Cpuset controller remote partition test matrix.
 #
@@ -513,7 +543,7 @@ write_cpu_online()
 		}
 	fi
 	echo $VAL > $CPUFILE
-	pause 0.05
+	pause 0.10
 }
 
 #
@@ -654,6 +684,8 @@ dump_states()
 		[[ -e $PCPUS  ]] && echo "$PCPUS: $(cat $PCPUS)"
 		[[ -e $ISCPUS ]] && echo "$ISCPUS: $(cat $ISCPUS)"
 	done
+	# Dump nohz_full
+	[[ -f $NOHZ_FULL ]] && echo "nohz_full: $(cat $NOHZ_FULL)"
 }
 
 #
@@ -789,6 +821,18 @@ check_isolcpus()
 		EXPECTED_SDOMAIN=$EXPECTED_ISOLCPUS
 	fi
 
+	#
+	# Check if nohz_full match cpuset.cpus.isolated if nohz_boot parameter
+	# specified with no parameter.
+	#
+	[[ -f $NOHZ_FULL && "$BOOT_NOHZ_FULL" = nohz_full ]] && {
+		NOHZ_FULL_CPUS=$(cat $NOHZ_FULL)
+		[[ "$ISOLCPUS" != "$NOHZ_FULL_CPUS" ]] && {
+			echo "nohz_full ($NOHZ_FULL_CPUS) does not match cpuset.cpus.isolated ($ISOLCPUS)"
+			return 1
+		}
+	}
+
 	#
 	# Appending pre-isolated CPUs
 	# Even though CPU #8 isn't used for testing, it can't be pre-isolated
@@ -1070,6 +1114,21 @@ run_remote_state_test()
 	echo "All $I tests of $TEST PASSED."
 }
 
+#
+# Testing CPU 0 isolated partition test when offline is disabled
+#
+run_cpu0_isol_test()
+{
+	# Skip the test if CPU0 offline is allowed or if nohz_full kernel
+	# boot parameter is missing.
+	CPU0_ONLINE=/sys/devices/system/cpu/cpu0/online
+	[[ -f $CPU0_ONLINE ]] && return
+	grep -q -w nohz_full /proc/cmdline
+	[[ $? -ne 0 ]] && return
+
+	run_state_test CPU0_ISOLCPUS_MATRIX
+}
+
 #
 # Testing the new "isolated" partition root type
 #
@@ -1207,6 +1266,7 @@ test_inotify()
 trap cleanup 0 2 3 6
 run_state_test TEST_MATRIX
 run_remote_state_test REMOTE_TEST_MATRIX
+run_cpu0_isol_test
 test_isolated
 test_inotify
 echo "All tests PASSED."
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
  2026-04-21  3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
@ 2026-04-21  8:32   ` Thomas Gleixner
  2026-04-21 14:14     ` Waiman Long
  2026-04-22  3:08   ` sashiko-bot
  1 sibling, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21  8:32 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> To provide nohz_full tick support, there is a set of tick dependency
> masks that need to be evaluated on every IRQ and context switch.

s/IRQ/interrupt/

This is a changelog and not a SMS service.

> Switching on nohz_full tick support at runtime will be problematic
> as some of the tick dependency masks may not be properly set causing
> problem down the road.

That's useless blurb with zero content.

> Allow nohz_full boot option to be specified without any
> parameter to force enable nohz_full tick support without any
> CPU in the tick_nohz_full_mask yet. The context_tracking_key and
> tick_nohz_full_running flag will be enabled in this case to make
> tick_nohz_full_enabled() return true.

I kinda can crystal-ball what you are trying to say here, but that does
not make it qualified as a proper change log.

> There is still a small performance overhead by force enable nohz_full
> this way. So it should only be used if there is a chance that some
> CPUs may become isolated later via the cpuset isolated partition
> functionality and better CPU isolation closed to nohz_full is desired.

Why has this key to be enabled on boot if there are no CPUs in the
isolated mask?

If you want to manage this dynamically at runtime then enable the key
once CPUs are isolated. Yes, it's more work, but that avoids the "should
only be used" nonsense and makes this more robust down the road.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
  2026-04-21  3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
@ 2026-04-21  8:50   ` Thomas Gleixner
  2026-04-21 14:24     ` Waiman Long
  2026-04-22  3:08   ` sashiko-bot
  1 sibling, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21  8:50 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> Full dynticks can only be enabled if "nohz_full" boot option has been
> been specified with or without parameter. Any change in the list of
> nohz_full CPUs have to be reflected in tick_nohz_full_mask. Introduce
> a new tick_nohz_full_update_cpus() helper that can be called to update
> the tick_nohz_full_mask at run time. The housekeeping_update() function
> is modified to call the new helper when the HK_TYPE_KERNEL_NOSIE cpumask
> is going to be changed.
>
> We also need to enable CPU context tracking for those CPUs that

We need nothing. Use passive voice for change logs as requested in
documentation.

> are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
> and ct_cpu_track_user() so that they be called later when an isolated
> cpuset partition is being created. The __ro_after_init attribute is
> taken away from context_tracking_key as well.
>
> Also add a new ct_cpu_untrack_user() function to reverse the action of
> ct_cpu_track_user() in case we need to disable the nohz_full mode of
> a CPU.
>
> With nohz_full enabled, the boot CPU (typically CPU 0) will be the
> tick CPU which cannot be shut down easily. So the boot CPU should not
> be used in an isolated cpuset partition.
>
> With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become
> TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE()
> checks in tick-sched.c to avoid unnecessary warnings.

in tick-sched.c? Describe the functions which contain that.

>  static inline void tick_nohz_task_switch(void)
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 925999de1a28..394e432630a3 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/context_tracking.h>
>  
> -DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
> +DEFINE_STATIC_KEY_FALSE(context_tracking_key);
>  EXPORT_SYMBOL_GPL(context_tracking_key);
>  
>  static noinstr bool context_tracking_recursion_enter(void)
> @@ -674,9 +674,9 @@ void user_exit_callable(void)
>  }
>  NOKPROBE_SYMBOL(user_exit_callable);
>  
> -void __init ct_cpu_track_user(int cpu)
> +void ct_cpu_track_user(int cpu)
>  {
> -	static __initdata bool initialized = false;
> +	static bool initialized;
>  
>  	if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
>  		static_branch_inc(&context_tracking_key);
> @@ -700,6 +700,15 @@ void __init ct_cpu_track_user(int cpu)
>  	initialized = true;
>  }
>  
> +void ct_cpu_untrack_user(int cpu)
> +{
> +	if (!per_cpu(context_tracking.active, cpu))
> +		return;
> +
> +	per_cpu(context_tracking.active, cpu) = false;
> +	static_branch_dec(&context_tracking_key);
> +}
> +

Why is this in a patch which makes tick/nohz related changes? This is a
preparatory change, so make it that way and do not bury it inside
something else.

> +/* Get the new set of run-time nohz CPU list & update accordingly */
> +void tick_nohz_full_update_cpus(struct cpumask *cpumask)
> +{
> +	int cpu;
> +
> +	if (!tick_nohz_full_running) {
> +		pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n");

That's the result of this enforced enable hackery. Make this work
properly.

> +		return;
> +	}
> +
> +	/*
> +	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
> +	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
> +	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
> +	 * for those CPUs that have their states changed. Those CPUs should be
> +	 * in an offline state.
> +	 */
> +	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
> +		WARN_ON_ONCE(cpu_online(cpu));
> +		ct_cpu_track_user(cpu);
> +		cpumask_set_cpu(cpu, tick_nohz_full_mask);
> +	}
> +
> +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
> +		WARN_ON_ONCE(cpu_online(cpu));
> +		ct_cpu_untrack_user(cpu);
> +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> +	}
> +}

So this writes to tick_nohz_full_mask while other CPUs can access
it. That's just wrong and I'm not at all interested in the resulting
KCSAN warnings.

tick_nohz_full_mask needs to become a RCU protected pointer, which is
updated once the new mask is established in a separately allocated one.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
  2026-04-21  3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
@ 2026-04-21  8:55   ` Thomas Gleixner
  2026-04-21 14:22     ` Waiman Long
  0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21  8:55 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> In tick_cpu_dying(), if the dying CPU is the current timekeeper,
> it has to pass the job over to another CPU. The current code passes
> it to another online CPU. However, that CPU may not be a timer tick
> housekeeping CPU.  If that happens, another CPU will have to manually
> take it over again later. Avoid this unnecessary work by directly
> assigning an online housekeeping CPU.
>
> Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the
> non-HK CPUs may not be in stop machine in the future.

'may not be in the future' is yet more handwaving without
content. Please write your change logs in a way so that people who have
not spent months on this can follow.

> @@ -394,12 +395,19 @@ int tick_cpu_dying(unsigned int dying_cpu)
>  {
>  	/*
>  	 * If the current CPU is the timekeeper, it's the only one that can
> -	 * safely hand over its duty. Also all online CPUs are in stop
> -	 * machine, guaranteed not to be idle, therefore there is no
> +	 * safely hand over its duty. Also all online housekeeping CPUs are
> +	 * in stop machine, guaranteed not to be idle, therefore there is no
>  	 * concurrency and it's safe to pick any online successor.
>  	 */
> -	if (tick_do_timer_cpu == dying_cpu)
> -		tick_do_timer_cpu = cpumask_first(cpu_online_mask);
> +	if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) {
> +		unsigned int new_cpu;
> +
> +		guard(rcu)();

What's this guard for?

> +		new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK));

Why has this to use housekeeping_cpumask() and does not use
tick_nohz_full_mask?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/23] cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  3:03 ` [PATCH 10/23] cpu: " Waiman Long
@ 2026-04-21  8:57   ` Thomas Gleixner
  2026-04-21 14:25     ` Waiman Long
  0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21  8:57 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
> RCU to protect access to the cpumask.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cpu.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bc4f7a9ba64e..0d02b5d7a7ba 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1890,6 +1890,8 @@ int freeze_secondary_cpus(int primary)
>  	cpu_maps_update_begin();
>  	if (primary == -1) {
>  		primary = cpumask_first(cpu_online_mask);
> +
> +		guard(rcu)();
>  		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
>  			primary = housekeeping_any_cpu(HK_TYPE_TIMER);

housekeeping_cpu() and housekeeping_any_cpu() can operate on two
different CPU masks once the runtime update is enabled.

Seriously?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/23] hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
@ 2026-04-21  8:59   ` Thomas Gleixner
  2026-04-22  3:09   ` sashiko-bot
  1 sibling, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21  8:59 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use

As the ...

> RCU to protect access to the cpumask.
>
> The access of HK_TYPE_TIMER cpumask within hrtimers_cpu_dying() is
> protected as interrupt is disabled and all the other CPUs are stopped

interrupts are disabled

> when this function is invoked as part of the CPU tear down process.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
  2026-04-21  3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
@ 2026-04-21  9:02   ` Thomas Gleixner
  2026-04-21 14:29     ` Waiman Long
  0 siblings, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21  9:02 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:

> As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
> use RCU to protect access to the cpumask.
>
> To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
> following steps can be done.

Can be done?

>  1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
>     CPUs and add back the de-isolated CPUs.
>  2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
>     to be called on the affected CPUs to migrate the irqs to other
>     HK_TYPE_MANAGED_IRQ housekeeping CPUs.
>  3) Bring up the previously offline CPUs to invoke
>     irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
>     be used for managed irqs.

Which previously offline CPUs?

> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 2e8072437826..8270c4de260b 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
>  	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
>  		const struct cpumask *hk_mask;
>  
> +		guard(rcu)();
>  		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
>  
>  		cpumask_and(tmp_mask, mask, hk_mask);

How is this hunk related to $Subject?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
  2026-04-21  8:32   ` Thomas Gleixner
@ 2026-04-21 14:14     ` Waiman Long
  0 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21 14:14 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On 4/21/26 4:32 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> To provide nohz_full tick support, there is a set of tick dependency
>> masks that need to be evaluated on every IRQ and context switch.
> s/IRQ/interrupt/
>
> This is a changelog and not a SMS service.
>> Switching on nohz_full tick support at runtime will be problematic
>> as some of the tick dependency masks may not be properly set causing
>> problem down the road.
> That's useless blurb with zero content.
>
>> Allow nohz_full boot option to be specified without any
>> parameter to force enable nohz_full tick support without any
>> CPU in the tick_nohz_full_mask yet. The context_tracking_key and
>> tick_nohz_full_running flag will be enabled in this case to make
>> tick_nohz_full_enabled() return true.
> I kinda can crystal-ball what you are trying to say here, but that does
> not make it qualified as a proper change log.
>
>> There is still a small performance overhead by force enable nohz_full
>> this way. So it should only be used if there is a chance that some
>> CPUs may become isolated later via the cpuset isolated partition
>> functionality and better CPU isolation closed to nohz_full is desired.
> Why has this key to be enabled on boot if there are no CPUs in the
> isolated mask?
>
> If you want to manage this dynamically at runtime then enable the key
> once CPUs are isolated. Yes, it's more work, but that avoids the "should
> only be used" nonsense and makes this more robust down the road.

OK, I will try to make it fully dynamic. Of course, it will be more work.

Cheers,
Longman

> Thanks,
>
>          tglx
>
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
  2026-04-21  8:55   ` Thomas Gleixner
@ 2026-04-21 14:22     ` Waiman Long
  0 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21 14:22 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On 4/21/26 4:55 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> In tick_cpu_dying(), if the dying CPU is the current timekeeper,
>> it has to pass the job over to another CPU. The current code passes
>> it to another online CPU. However, that CPU may not be a timer tick
>> housekeeping CPU.  If that happens, another CPU will have to manually
>> take it over again later. Avoid this unnecessary work by directly
>> assigning an online housekeeping CPU.
>>
>> Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the
>> non-HK CPUs may not be in stop machine in the future.
> 'may not be in the future' is yet more handwaving without
> content. Please write your change logs in a way so that people who have
> not spent months on this can follow.
>
>> @@ -394,12 +395,19 @@ int tick_cpu_dying(unsigned int dying_cpu)
>>   {
>>   	/*
>>   	 * If the current CPU is the timekeeper, it's the only one that can
>> -	 * safely hand over its duty. Also all online CPUs are in stop
>> -	 * machine, guaranteed not to be idle, therefore there is no
>> +	 * safely hand over its duty. Also all online housekeeping CPUs are
>> +	 * in stop machine, guaranteed not to be idle, therefore there is no
>>   	 * concurrency and it's safe to pick any online successor.
>>   	 */
>> -	if (tick_do_timer_cpu == dying_cpu)
>> -		tick_do_timer_cpu = cpumask_first(cpu_online_mask);
>> +	if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) {
>> +		unsigned int new_cpu;
>> +
>> +		guard(rcu)();
> What's this guard for?
>
>> +		new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK));
> Why has this to use housekeeping_cpumask() and does not use
> tick_nohz_full_mask?

The RCU guard is for accessing the HK_TYPE_TICK(HK_TYPE_KERNEL_NOISE) 
cpumask. tick_nohz_full_mask cpumask is actually the inverse of 
HK_TYPE_TICK cpumask. Yes, I could use cpumask_first_andnot() with 
tick_nohz_full_mask. If we make tick_nohz_full_mask an RCU protected 
pointer, we still need the guard.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
  2026-04-21  8:50   ` Thomas Gleixner
@ 2026-04-21 14:24     ` Waiman Long
  0 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21 14:24 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On 4/21/26 4:50 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> Full dynticks can only be enabled if "nohz_full" boot option has been
>> been specified with or without parameter. Any change in the list of
>> nohz_full CPUs have to be reflected in tick_nohz_full_mask. Introduce
>> a new tick_nohz_full_update_cpus() helper that can be called to update
>> the tick_nohz_full_mask at run time. The housekeeping_update() function
>> is modified to call the new helper when the HK_TYPE_KERNEL_NOSIE cpumask
>> is going to be changed.
>>
>> We also need to enable CPU context tracking for those CPUs that
> We need nothing. Use passive voice for change logs as requested in
> documentation.
>
>> are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
>> and ct_cpu_track_user() so that they be called later when an isolated
>> cpuset partition is being created. The __ro_after_init attribute is
>> taken away from context_tracking_key as well.
>>
>> Also add a new ct_cpu_untrack_user() function to reverse the action of
>> ct_cpu_track_user() in case we need to disable the nohz_full mode of
>> a CPU.
>>
>> With nohz_full enabled, the boot CPU (typically CPU 0) will be the
>> tick CPU which cannot be shut down easily. So the boot CPU should not
>> be used in an isolated cpuset partition.
>>
>> With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become
>> TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE()
>> checks in tick-sched.c to avoid unnecessary warnings.
> in tick-sched.c? Describe the functions which contain that.
>
>>   static inline void tick_nohz_task_switch(void)
>> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
>> index 925999de1a28..394e432630a3 100644
>> --- a/kernel/context_tracking.c
>> +++ b/kernel/context_tracking.c
>> @@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/context_tracking.h>
>>   
>> -DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
>> +DEFINE_STATIC_KEY_FALSE(context_tracking_key);
>>   EXPORT_SYMBOL_GPL(context_tracking_key);
>>   
>>   static noinstr bool context_tracking_recursion_enter(void)
>> @@ -674,9 +674,9 @@ void user_exit_callable(void)
>>   }
>>   NOKPROBE_SYMBOL(user_exit_callable);
>>   
>> -void __init ct_cpu_track_user(int cpu)
>> +void ct_cpu_track_user(int cpu)
>>   {
>> -	static __initdata bool initialized = false;
>> +	static bool initialized;
>>   
>>   	if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
>>   		static_branch_inc(&context_tracking_key);
>> @@ -700,6 +700,15 @@ void __init ct_cpu_track_user(int cpu)
>>   	initialized = true;
>>   }
>>   
>> +void ct_cpu_untrack_user(int cpu)
>> +{
>> +	if (!per_cpu(context_tracking.active, cpu))
>> +		return;
>> +
>> +	per_cpu(context_tracking.active, cpu) = false;
>> +	static_branch_dec(&context_tracking_key);
>> +}
>> +
> Why is this in a patch which makes tick/nohz related changes? This is a
> preparatory change, so make it that way and do not bury it inside
> something else.
Sure, I will break out the context tracking change into its own patch.
>> +/* Get the new set of run-time nohz CPU list & update accordingly */
>> +void tick_nohz_full_update_cpus(struct cpumask *cpumask)
>> +{
>> +	int cpu;
>> +
>> +	if (!tick_nohz_full_running) {
>> +		pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n");
> That's the result of this enforced enable hackery. Make this work
> properly.
>
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
>> +	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
>> +	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
>> +	 * for those CPUs that have their states changed. Those CPUs should be
>> +	 * in an offline state.
>> +	 */
>> +	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
>> +		WARN_ON_ONCE(cpu_online(cpu));
>> +		ct_cpu_track_user(cpu);
>> +		cpumask_set_cpu(cpu, tick_nohz_full_mask);
>> +	}
>> +
>> +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
>> +		WARN_ON_ONCE(cpu_online(cpu));
>> +		ct_cpu_untrack_user(cpu);
>> +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
>> +	}
>> +}
> So this writes to tick_nohz_full_mask while other CPUs can access
> it. That's just wrong and I'm not at all interested in the resulting
> KCSAN warnings.
>
> tick_nohz_full_mask needs to become a RCU protected pointer, which is
> updated once the new mask is established in a separately allocated one.

I will work on that.

Cheers,
Longman

>
> Thanks,
>
>          tglx
>
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/23] cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  8:57   ` Thomas Gleixner
@ 2026-04-21 14:25     ` Waiman Long
  0 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21 14:25 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On 4/21/26 4:57 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
>> RCU to protect access to the cpumask.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cpu.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index bc4f7a9ba64e..0d02b5d7a7ba 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -1890,6 +1890,8 @@ int freeze_secondary_cpus(int primary)
>>   	cpu_maps_update_begin();
>>   	if (primary == -1) {
>>   		primary = cpumask_first(cpu_online_mask);
>> +
>> +		guard(rcu)();
>>   		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
>>   			primary = housekeeping_any_cpu(HK_TYPE_TIMER);
> housekeeping_cpu() and housekeeping_any_cpu() can operate on two
> different CPU masks once the runtime update is enabled.
>
> Seriously?

Good point, will fix that in the next version.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
  2026-04-21  9:02   ` Thomas Gleixner
@ 2026-04-21 14:29     ` Waiman Long
  0 siblings, 0 replies; 56+ messages in thread
From: Waiman Long @ 2026-04-21 14:29 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On 4/21/26 5:02 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>
>> As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
>> use RCU to protect access to the cpumask.
>>
>> To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
>> following steps can be done.
> Can be done?
>
>>   1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
>>      CPUs and add back the de-isolated CPUs.
>>   2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
>>      to be called on the affected CPUs to migrate the irqs to other
>>      HK_TYPE_MANAGED_IRQ housekeeping CPUs.
>>   3) Bring up the previously offline CPUs to invoke
>>      irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
>>      be used for managed irqs.
> Which previously offline CPUs?
This part should go into another patch.
>
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 2e8072437826..8270c4de260b 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
>>   	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
>>   		const struct cpumask *hk_mask;
>>   
>> +		guard(rcu)();
>>   		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
>>   
>>   		cpumask_and(tmp_mask, mask, hk_mask);
> How is this hunk related to $Subject?

The subject is actually about using RCU to protect access to 
housekeeping cpumask. There are extra info in the commit  log that 
should go to another patch.

Cheers,
Longman

>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
  2026-04-21  3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
@ 2026-04-21 16:17   ` Thomas Gleixner
  2026-04-21 17:29     ` Waiman Long
  2026-04-22  3:09   ` sashiko-bot
  1 sibling, 1 reply; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21 16:17 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> Add a new cpuhp_offline_cb() API that allows us to offline a set of
> CPUs one-by-one, run the given callback function and then bring those
> CPUs back online again while inhibiting any concurrent CPU hotplug
> operations from happening.

Please provide a properly structured change log which explains the
context, the problem and the solution in separate paragraphs and this
order. This is not new. It's documented...

> This new API can be used to enable runtime adjustment of nohz_full and
> isolcpus boot command line options. A new cpuhp_offline_cb_mode flag
> is also added to signal that the system is in this offline callback
> transient state so that some hotplug operations can be optimized out
> if we choose to.

We chose nothing.

> +#include <linux/cpumask_types.h>

What for? This header only needs a 'struct cpumask' forward declaration
so that the compiler can handle the pointer argument, no?

> +typedef int (*cpuhp_cb_t)(void *arg);

You couldn't come up with a more generic name for this, right?

>  struct device;
>  
>  extern int lockdep_is_cpus_held(void);
> @@ -29,6 +31,8 @@ void clear_tasks_mm_cpumask(int cpu);
>  int remove_cpu(unsigned int cpu);
>  int cpu_device_down(struct device *dev);
>  void smp_shutdown_nonboot_cpus(unsigned int primary_cpu);
> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg);

Ditto.

> +extern bool cpuhp_offline_cb_mode;

Groan. The only users are in the cpusets code which invokes this muck
and should therefore know what's going on, no?

>  #else /* CONFIG_HOTPLUG_CPU */
>  
> @@ -43,6 +47,11 @@ static inline void cpu_hotplug_disable(void) { }
>  static inline void cpu_hotplug_enable(void) { }
>  static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
>  static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
> +static inline int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
> +{
> +	return -EPERM;

-EPERM?

> +/**
> + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
> + * @mask: A mask of CPUs to be taken offline and then online
> + * @func: A callback function to be invoked while the given CPUs are offline
> + * @arg:  Argument to be passed back to the callback function
> + *
> + * Return: 0 if successful, an error code otherwise
> + */
> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
> +{
> +	int off_cpu, on_cpu, ret, ret2 = 0;
> +
> +	if (WARN_ON_ONCE(cpumask_empty(mask) ||
> +	   !cpumask_subset(mask, cpu_online_mask)))
> +		return -EINVAL;

No line break required. You have 100 characters.

But what's worse is that the access to cpu_online_mask is not protected
against a concurrent CPU hotplug operation.

> +
> +	pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));

Tracing?

> +	lock_device_hotplug();
> +	cpuhp_offline_cb_mode = true;
> +	/*
> +	 * If all offline operations succeed, off_cpu should become nr_cpu_ids.
> +	 */
> +	for_each_cpu(off_cpu, mask) {
> +		ret = device_offline(get_cpu_device(off_cpu));
> +		if (unlikely(ret))
> +			break;
> +	}
> +	if (!ret)
> +		ret = func(arg);
> +
> +	/* Bring previously offline CPUs back online */
> +	for_each_cpu(on_cpu, mask) {
> +		int retries = 0;
> +
> +		if (on_cpu == off_cpu)
> +			break;
> +
> +retry:
> +		ret2 = device_online(get_cpu_device(on_cpu));
> +
> +		/*
> +		 * With the unlikely event that CPU hotplug is disabled while
> +		 * this operation is in progress, we will need to wait a bit
> +		 * for hotplug to hopefully be re-enabled again. If not, print
> +		 * a warning and return the error.
> +		 *
> +		 * cpu_hotplug_disabled is supposed to be accessed while
> +		 * holding the cpu_add_remove_lock mutex. So we need to
> +		 * use the data_race() macro to access it here.
> +		 */
> +		while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
> +		       (++retries <= 5)) {
> +			msleep(20);
> +			if (!data_race(cpu_hotplug_disabled))
> +				goto retry;
> +		}
> +		if (ret2) {
> +			pr_warn("%s: Failed to bring CPU %d back online!\n",
> +				__func__, on_cpu);

Provide a proper text and not this silly __func__ thing.

> +			break;
> +		}
> +	}

TBH. This is unreviewable gunk and the whole 'unlikely event that CPU
hotplug is disabled' is just a lazy hack.

All of this can be avoided including this made up callback function.

It's not rocket science to provide:

     1) A function which serializes against any other CPU hotplug
        related action.

     2) A function which brings the CPUs in a given CPU mask down

     3) A function which brings the CPUs in a given CPU mask up

     4) A function which undoes #1

Yeah I know, it's more work and not convoluted enough. But see below.

That brings me to that other hack namely cpuhp_offline_cb_mode, which
you self described as such in patch 21/23:

> +	/*
> +	 * Hack: In cpuhp_offline_cb_mode, pretend all partitions are empty
> +	 * to prevent unnecessary partition invalidation.
> +	 */
> +	if (cpuhp_offline_cb_mode)
> +		return false;
> +

We are not merging hacks. End of story. But you knew that already, no?

Let's take a step back and see what you really need to achieve:

  1) Update tick_nohz_full_mask
  2) Update the managed interrupt mask
  3) Update CPU sets

Independent of the direction of this update you need to ensure that the
affected functionality keeps working correctly.

You achieve that by bulk offlining the affected CPUs, invoking a magic
callback and then bulk onlining the affected CPUs again, which requires
that ill defined cpuhp_offline_cb_mode hackery and probably some more
hacks all over the place.

You can achieve the same by doing CPU by CPU operations in the right
order without this mode hack, when you establish proper limitations for
this:

  At no point in time it's allowed to empty a CPU set or a affected CPU
  mask, except when you completely undo the isolation of CPUs.

  That can be computed upfront w/o changing anything at all. Once the
  validity is established, the update can proceed. Or you can leave it
  to user space which can keep the pieces if it gets it wrong.

That's a reasonable limitation as there is absolutely zero justification
to support something like:

       housekeeping_cpus = [CPU 0], isolated_cpus = [CPU 1]
  ---> housekeeping_cpus = [CPU 1], isolated_cpus = [CPU 0]

just because we can with enough horrible hacks.

If you get that out of the way, then a CPU by CPU update becomes the
obvious and simplest solution. The ordering constraints can be computed
in user space upfront and there is no reason to do any of this in the
kernel itself except for an eventual validation step. It might be a tad
slower, but this is all but a hotpath operation.

Just for the record. I suggested exactly this more than a year ago and
it's still the right thing to do.

And of course neither your cover letter nor any of the patches give a
proper rationale why you think that your bulk hackery is better. For the
very simple reason that there is no rationale at all.

This bulk muck is doomed when your ultimate goal is to avoid the stop
machine dance. With a per CPU update it is actually doable without more
ill defined hacks all over the place.

   1) Bring down the CPU to CPUHP_AP_SCHED_WAIT_EMPTY, which is the last
      state before stop machine is invoked.

      At that point:

         - no user space thread is running on the CPU anymore

         - everything related to this CPU has been shut down or moved
           elsewhere

         - interrupt managed device queues are quiesced if the CPU was
           the last online one in the queue affinity mask. If not the
           interrupt might still be affine to the CPU, but there is at
           least one other CPU available in the mask.

   2) Update the tick NOHZ handover

      This can be done without going into stop machine by providing a
      hotplug callback right between CPUHP_AP_SMPBOOT_THREADS and
      CPUHP_AP_IRQ_AFFINITY_ONLINE.

      That's trivial enough to achieve and can work independently of
      NOHZ full.

   3) Rework the affinity management, so that interrupt affinities can
      be reassigned in the CPUHP_AP_IRQ_AFFINITY_ONLINE state.

      That needs a lot of thoughts, but there is no real reason why it
      can't work.

   4) Flip the housekeeping CPU masks in sched_cpu_wait_empty() after
      balance_hotplug_wait().

   5) Bring the CPU online again.

For #2 and #3 to work you need a separate CPU mask which avoids touching
CPU online mask. For #3 this needs some more work to avoid reassigning the
interrupts once sparse_irq_lock is dropped, but the bulk is achieved
with the separate CPU mask.

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
  2026-04-21 16:17   ` Thomas Gleixner
@ 2026-04-21 17:29     ` Waiman Long
  2026-04-21 18:43       ` Thomas Gleixner
  0 siblings, 1 reply; 56+ messages in thread
From: Waiman Long @ 2026-04-21 17:29 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On 4/21/26 12:17 PM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> Add a new cpuhp_offline_cb() API that allows us to offline a set of
>> CPUs one-by-one, run the given callback function and then bring those
>> CPUs back online again while inhibiting any concurrent CPU hotplug
>> operations from happening.
> Please provide a properly structured change log which explains the
> context, the problem and the solution in separate paragraphs and this
> order. This is not new. It's documented...
>
>> This new API can be used to enable runtime adjustment of nohz_full and
>> isolcpus boot command line options. A new cpuhp_offline_cb_mode flag
>> is also added to signal that the system is in this offline callback
>> transient state so that some hotplug operations can be optimized out
>> if we choose to.
> We chose nothing.
>
>> +#include <linux/cpumask_types.h>
> What for? This header only needs a 'struct cpumask' forward declaration
> so that the compiler can handle the pointer argument, no?
>
>> +typedef int (*cpuhp_cb_t)(void *arg);
> You couldn't come up with a more generic name for this, right?
>
>>   struct device;
>>   
>>   extern int lockdep_is_cpus_held(void);
>> @@ -29,6 +31,8 @@ void clear_tasks_mm_cpumask(int cpu);
>>   int remove_cpu(unsigned int cpu);
>>   int cpu_device_down(struct device *dev);
>>   void smp_shutdown_nonboot_cpus(unsigned int primary_cpu);
>> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg);
> Ditto.
>
>> +extern bool cpuhp_offline_cb_mode;
> Groan. The only users are in the cpusets code which invokes this muck
> and should therefore know what's going on, no?
>
>>   #else /* CONFIG_HOTPLUG_CPU */
>>   
>> @@ -43,6 +47,11 @@ static inline void cpu_hotplug_disable(void) { }
>>   static inline void cpu_hotplug_enable(void) { }
>>   static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
>>   static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
>> +static inline int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
>> +{
>> +	return -EPERM;
> -EPERM?
>
>> +/**
>> + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
>> + * @mask: A mask of CPUs to be taken offline and then online
>> + * @func: A callback function to be invoked while the given CPUs are offline
>> + * @arg:  Argument to be passed back to the callback function
>> + *
>> + * Return: 0 if successful, an error code otherwise
>> + */
>> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
>> +{
>> +	int off_cpu, on_cpu, ret, ret2 = 0;
>> +
>> +	if (WARN_ON_ONCE(cpumask_empty(mask) ||
>> +	   !cpumask_subset(mask, cpu_online_mask)))
>> +		return -EINVAL;
> No line break required. You have 100 characters.
>
> But what's worse is that the access to cpu_online_mask is not protected
> against a concurrent CPU hotplug operation.
>
>> +
>> +	pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));
> Tracing?
>
>> +	lock_device_hotplug();
>> +	cpuhp_offline_cb_mode = true;
>> +	/*
>> +	 * If all offline operations succeed, off_cpu should become nr_cpu_ids.
>> +	 */
>> +	for_each_cpu(off_cpu, mask) {
>> +		ret = device_offline(get_cpu_device(off_cpu));
>> +		if (unlikely(ret))
>> +			break;
>> +	}
>> +	if (!ret)
>> +		ret = func(arg);
>> +
>> +	/* Bring previously offline CPUs back online */
>> +	for_each_cpu(on_cpu, mask) {
>> +		int retries = 0;
>> +
>> +		if (on_cpu == off_cpu)
>> +			break;
>> +
>> +retry:
>> +		ret2 = device_online(get_cpu_device(on_cpu));
>> +
>> +		/*
>> +		 * With the unlikely event that CPU hotplug is disabled while
>> +		 * this operation is in progress, we will need to wait a bit
>> +		 * for hotplug to hopefully be re-enabled again. If not, print
>> +		 * a warning and return the error.
>> +		 *
>> +		 * cpu_hotplug_disabled is supposed to be accessed while
>> +		 * holding the cpu_add_remove_lock mutex. So we need to
>> +		 * use the data_race() macro to access it here.
>> +		 */
>> +		while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
>> +		       (++retries <= 5)) {
>> +			msleep(20);
>> +			if (!data_race(cpu_hotplug_disabled))
>> +				goto retry;
>> +		}
>> +		if (ret2) {
>> +			pr_warn("%s: Failed to bring CPU %d back online!\n",
>> +				__func__, on_cpu);
> Provide a proper text and not this silly __func__ thing.
>
>> +			break;
>> +		}
>> +	}
> TBH. This is unreviewable gunk and the whole 'unlikely event that CPU
> hotplug is disabled' is just a lazy hack.
>
> All of this can be avoided including this made up callback function.
>
> It's not rocket science to provide:
>
>       1) A function which serializes against any other CPU hotplug
>          related action.
>
>       2) A function which brings the CPUs in a given CPU mask down
>
>       3) A function which brings the CPUs in a given CPU mask up
>
>       4) A function which undoes #1
>
> Yeah I know, it's more work and not convoluted enough. But see below.
>
> That brings me to that other hack namely cpuhp_offline_cb_mode, which
> you self described as such in patch 21/23:
>
>> +	/*
>> +	 * Hack: In cpuhp_offline_cb_mode, pretend all partitions are empty
>> +	 * to prevent unnecessary partition invalidation.
>> +	 */
>> +	if (cpuhp_offline_cb_mode)
>> +		return false;
>> +
> We are not merging hacks. End of story. But you knew that already, no?
>
> Let's take a step back and see what you really need to achieve:
>
>    1) Update tick_nohz_full_mask
>    2) Update the managed interrupt mask
>    3) Update CPU sets
>
> Independent of the direction of this update you need to ensure that the
> affected functionality keeps working correctly.
>
> You achieve that by bulk offlining the affected CPUs, invoking a magic
> callback and then bulk onlining the affected CPUs again, which requires
> that ill defined cpuhp_offline_cb_mode hackery and probably some more
> hacks all over the place.
>
> You can achieve the same by doing CPU by CPU operations in the right
> order without this mode hack, when you establish proper limitations for
> this:
>
>    At no point in time it's allowed to empty a CPU set or a affected CPU
>    mask, except when you completely undo the isolation of CPUs.
>
>    That can be computed upfront w/o changing anything at all. Once the
>    validity is established, the update can proceed. Or you can leave it
>    to user space which can keep the pieces if it gets it wrong.
>
> That's a reasonable limitation as there is absolutely zero justification
> to support something like:
>
>         housekeeping_cpus = [CPU 0], isolated_cpus = [CPU 1]
>    ---> housekeeping_cpus = [CPU 1], isolated_cpus = [CPU 0]
>
> just because we can with enough horrible hacks.
>
> If you get that out of the way, then a CPU by CPU update becomes the
> obvious and simplest solution. The ordering constraints can be computed
> in user space upfront and there is no reason to do any of this in the
> kernel itself except for an eventual validation step. It might be a tad
> slower, but this is all but a hotpath operation.
>
> Just for the record. I suggested exactly this more than a year ago and
> it's still the right thing to do.
>
> And of course neither your cover letter nor any of the patches give a
> proper rationale why you think that your bulk hackery is better. For the
> very simple reason that there is no rationale at all.
>
> This bulk muck is doomed when your ultimate goal is to avoid the stop
> machine dance. With a per CPU update it is actually doable without more
> ill defined hacks all over the place.
>
>     1) Bring down the CPU to CPUHP_AP_SCHED_WAIT_EMPTY, which is the last
>        state before stop machine is invoked.
>
>        At that point:
>
>           - no user space thread is running on the CPU anymore
>
>           - everything related to this CPU has been shut down or moved
>             elsewhere
>
>           - interrupt managed device queues are quiesced if the CPU was
>             the last online one in the queue affinity mask. If not the
>             interrupt might still be affine to the CPU, but there is at
>             least one other CPU available in the mask.
>
>     2) Update the tick NOHZ handover
>
>        This can be done without going into stop machine by providing a
>        hotplug callback right between CPUHP_AP_SMPBOOT_THREADS and
>        CPUHP_AP_IRQ_AFFINITY_ONLINE.
>
>        That's trivial enough to achieve and can work independently of
>        NOHZ full.
>
>     3) Rework the affinity management, so that interrupt affinities can
>        be reassigned in the CPUHP_AP_IRQ_AFFINITY_ONLINE state.
>
>        That needs a lot of thoughts, but there is no real reason why it
>        can't work.
>
>     4) Flip the housekeeping CPU masks in sched_cpu_wait_empty() after
>        balance_hotplug_wait().
>
>     5) Bring the CPU online again.
>
> For #2 and #3 to work you need a separate CPU mask which avoids touching
> CPU online mask. For #3 this needs some more work to avoid reassigning the
> interrupts once sparse_irq_lock is dropped, but the bulk is achieved
> with the separate CPU mask.
>
> No?

Thanks for the great suggestions. I will certainly look into that.

We actually have a cpu_active_mask that will be cleared early in 
sched_cpu_deactivate(). In the CPUHP_AP_SCHED_WAIT_EMPTY state, the CPU 
will still have online bit set but the active bit will be cleared. Or we 
could add another cpumask that can be used to indicate CPUs that have 
reached CPUHP_AP_SCHED_WAIT_EMPTY or below if necessary.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
  2026-04-21 17:29     ` Waiman Long
@ 2026-04-21 18:43       ` Thomas Gleixner
  0 siblings, 0 replies; 56+ messages in thread
From: Thomas Gleixner @ 2026-04-21 18:43 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan

On Tue, Apr 21 2026 at 13:29, Waiman Long wrote:
> On 4/21/26 12:17 PM, Thomas Gleixner wrote:
> Thanks for the great suggestions. I will certainly look into that.
>
> We actually have a cpu_active_mask that will be cleared early in 
> sched_cpu_deactivate(). In the CPUHP_AP_SCHED_WAIT_EMPTY state, the CPU 
> will still have online bit set but the active bit will be cleared. Or we 
> could add another cpumask that can be used to indicate CPUs that have 
> reached CPUHP_AP_SCHED_WAIT_EMPTY or below if necessary.

Right. Active mask is immediately cleared when a CPU goes down so that
the scheduler does not enqueue new tasks on it. But you can't use it for
interrupts because on CPU up the mask must be up to date when
irq_affinity_online_cpu() is invoked. The tick has the same constraints.

So for interrupts this should be handled in CPUHP_AP_IRQ_AFFINITY_ONLINE
both in the existing up and the new down callback. That can be a
interrupt core local CPU mask which is updated on the callbacks with the
sparse_irq_lock held.

Same for the tick handover magic.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask
  2026-04-21  3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
@ 2026-04-22  3:08   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: mkoutny, tj, hannes, linux-hwmon

Sashiko AI review found 1 potential issue(s):
- [High] Dynamic RCU NOCB update attempts to de-offload online `rcu_nocbs=` CPUs, triggering a WARN_ON_ONCE and leaving isolation state inconsistent.
--

commit a58a92fdc96e43de871fd44a78e4e6da373179d5
Author: Waiman Long <longman@redhat.com>

rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask

This commit enables dynamic runtime updates to the RCU NO-CB CPU offloading
state when isolated CPUs are added or removed. It introduces a new helper,
rcu_nocb_update_cpus(), which is called directly from housekeeping_update()
to synchronize the offloaded CPUs with the HK_TYPE_KERNEL_NOISE cpumask.

> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 2d06dcb61f379..b2daba1e5cb98 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
[ ... ]
> +void rcu_nocb_update_cpus(struct cpumask *cpumask)
> +{
> +	int cpu, ret;
> +
> +	if (!rcu_state.nocb_is_setup) {
> +		pr_warn_once("Dynamic RCU NOCB cannot be enabled without nohz_full/rcu_nocbs kernel boot parameter!\n");
> +		return;
> +	}
> +
> +	for_each_cpu_andnot(cpu, cpumask, rcu_nocb_mask) {
> +		ret = rcu_nocb_cpu_offload(cpu);
> +		if (WARN_ON_ONCE(ret))
> +			return;
> +	}
> +
> +	for_each_cpu_andnot(cpu, rcu_nocb_mask, cpumask) {
> +		ret = rcu_nocb_cpu_deoffload(cpu);

What happens here if a system booted with explicit rcu_nocbs= CPUs, and
later a user dynamically updates an isolated cpuset?

It looks like this loop assumes the active rcu_nocb_mask should identically
match the new isol_mask. If the boot-time rcu_nocbs= CPUs are not included
in the new isolation mask, this loop will attempt to de-offload them.

Because the cpuset subsystem is unaware of the independent rcu_nocbs= mask,
wouldn't those original CPUs likely remain online?

> +		if (WARN_ON_ONCE(ret))
> +			return;

If rcu_nocb_cpu_deoffload() is called on an online CPU, it returns -EINVAL.
Could this unconditionally trigger this WARN_ON_ONCE and abort the rest of
the update loop?

If the loop aborts early, since housekeeping_update() already committed the
new mask to housekeeping.cpumasks[], does this leave the system in a
desynchronized state where the scheduler assumes CPUs are isolated but
RCU callback offloading is not properly updated?

> +	}
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=6

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
  2026-04-21  3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
  2026-04-21  8:50   ` Thomas Gleixner
@ 2026-04-22  3:08   ` sashiko-bot
  1 sibling, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: hannes, linux-hwmon, tj, mkoutny

Sashiko AI review found 4 potential issue(s):
- [High] Removing the TICK_DO_TIMER_NONE check in can_stop_idle_tick() allows all housekeeping CPUs to stop their ticks, permanently freezing timekeeping.
- [High] Modifying context_tracking.active for online CPUs bypasses RCU context tracking, causing Use-After-Free vulnerabilities.
- [High] Late context tracking initialization fails to propagate TIF_NOHZ to existing tasks and triggers an unconditional WARN_ON_ONCE.
- [Medium] Incomplete implementation: failed to remove the __init attribute from tick_nohz_init().
--

commit 4a1b2d82a8726c69f585b6c42f365c673f890897
Author: Waiman Long <longman@redhat.com>

tick/nohz: Allow runtime changes in full dynticks CPUs

This commit introduces tick_nohz_full_update_cpus() to update
tick_nohz_full_mask at runtime when isolated cpusets are changed.
It modifies context tracking to allow enabling and disabling user
context tracking dynamically.

>     We also need to enable CPU context tracking for those CPUs that
>     are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
>     and ct_cpu_track_user() so that they be called later when an isolated
>     cpuset partition is being created.

This isn't a bug, but the commit message says __init is removed from
tick_nohz_init(), but this change doesn't appear in the patch diff.

> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 925999de1a28c..394e432630a39 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
[ ... ]
> @@ -674,9 +674,9 @@ void user_exit_callable(void)
>  }
>  NOKPROBE_SYMBOL(user_exit_callable);
>  
> -void __init ct_cpu_track_user(int cpu)
> +void ct_cpu_track_user(int cpu)
>  {
> -	static __initdata bool initialized = false;
> +	static bool initialized;

If this function runs for the first time during a runtime update, will it
trigger the WARN_ON_ONCE(!tasklist_empty()) located further down in this
function?

Also, since only the init_task receives the TIF_NOHZ flag during
this initialization block, will existing tasks that migrate to the newly
isolated CPUs be missing the flag and bypass context tracking checks?

> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 081f674a65882..67a747207f735 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
[ ... ]
> @@ -629,6 +626,36 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask)
>  	tick_nohz_full_running = true;
>  }
>  
> +/* Get the new set of run-time nohz CPU list & update accordingly */
> +void tick_nohz_full_update_cpus(struct cpumask *cpumask)
> +{
[ ... ]
> +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
> +		WARN_ON_ONCE(cpu_online(cpu));
> +		ct_cpu_untrack_user(cpu);
> +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> +	}
> +}

Could this cause Use-After-Free issues if a CPU is online and executing in
userspace when its context tracking is disabled?

The WARN_ON_ONCE warns if the CPU is online, but doesn't prevent
ct_cpu_untrack_user() from running.

If the active flag is cleared while the CPU is in an RCU Extended Quiescent
State, won't it enter the kernel without calling rcu_user_exit(), causing
RCU to incorrectly assume the CPU is still in a quiescent state?

[ ... ]
> @@ -1238,10 +1265,6 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>  		 */
>  		if (tick_cpu == cpu)
>  			return false;
> -
> -		/* Should not happen for nohz-full */
> -		if (WARN_ON_ONCE(tick_cpu == TICK_DO_TIMER_NONE))
> -			return false;
>  	}
>  
>  	return true;

Is it possible for timekeeping to permanently freeze here?

If tick_do_timer_cpu becomes TICK_DO_TIMER_NONE as mentioned in the commit
message, and all non-isolated housekeeping CPUs enter idle, won't they all
evaluate tick_cpu == cpu as false?

Without this check, it seems they will all proceed to stop their ticks,
leaving no CPU to take over the timekeeping duty.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=4

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs
  2026-04-21  3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
@ 2026-04-22  3:08   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: mkoutny, linux-hwmon, tj, hannes

Sashiko AI review found 1 potential issue(s):
- [High] Silent overwrite and clobbering of userspace `watchdog_cpumask` sysctl configuration.
--

commit c51e1c592377490b0da49dd889417304c34badc2
Author: Waiman Long <longman@redhat.com>

watchdog: Sync up with runtime change of isolated CPUs

This commit updates the watchdog to exclude isolated CPUs at runtime by
adding a new watchdog_cpumask_update() helper. It modifies watchdog_cpumask
and watchdog_allowed_mask when the HK_TYPE_KERNEL_NOISE cpumask is updated.

> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 7d675781bc917..f35374abf457c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
[ ... ]
> +/**
> + * watchdog_cpumask_update - update watchdog_cpumask & watchdog_allowed_mask
> + * @isol_mask: cpumask of isolated CPUs
> + *
> + * Update watchdog_cpumask and watchdog_allowed_mask to be inverse of the
> + * given isolated cpumask to disable watchdog activities on isolated CPUs.
> + * It should be called with the affected CPUs in offline state which will be
> + * brought up online later.
> + *
> + * Any changes made in watchdog_cpumask by users via the sysctl parameter will
> + * be overridden. However, proc_watchdog_update() isn't called. So change will
> + * only happens on CPUs that will brought up later on to minimize changes to
> + * the existing watchdog configuration.
> + */
> +void watchdog_cpumask_update(struct cpumask *isol_mask)
> +{
> +	mutex_lock(&watchdog_mutex);
> +	cpumask_andnot(&watchdog_cpumask, cpu_possible_mask, isol_mask);

Does this unintentionally discard the existing sysctl configuration for
kernel.watchdog_cpumask?

By using cpu_possible_mask as the base, it looks like this unconditionally
overwrites the entire watchdog_cpumask with all possible CPUs except the
newly isolated ones. 

If an administrator had explicitly disabled watchdogs on CPUs 1-6 using the
sysctl interface, and later dynamically isolates CPU 7, would this operation
silently clobber the user API contract and re-enable watchdogs on CPUs 1-6
when they are next onlined or reconfigured?

> +#ifdef CONFIG_SOFTLOCKUP_DETECTOR
> +	cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
> +#endif
> +	mutex_unlock(&watchdog_mutex);
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=7

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask
  2026-04-21  3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
@ 2026-04-22  3:08   ` sashiko-bot
  2026-04-22  6:39   ` Chen Ridong
  1 sibling, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: tj, mkoutny, linux-hwmon, hannes

Sashiko AI review found 2 potential issue(s):
- [Low] The dummy static inline function for `housekeeping_update()` for `!CONFIG_CPU_ISOLATION` kernels was not updated with the new `flags` parameter.
- [Low] The `hk_types` mapping array is missing the `const` qualifier.
--

commit a7583f48d85bde2543fa30edd95c2d2d24ceae16
Author: Waiman Long <longman@redhat.com>

sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask

This commit extends housekeeping_update to support updating multiple
housekeeping cpumasks simultaneously. It adds a new flags parameter to
pass the target housekeeping types and ensures the new mask is a subset
of the boot cpumask.

> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index d1707f121e203..a17f16e0156e9 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -51,7 +51,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
>  extern bool housekeeping_enabled(enum hk_type type);
>  extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
>  extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
> -extern int housekeeping_update(struct cpumask *isol_mask);
> +extern int housekeeping_update(struct cpumask *isol_mask, unsigned long flags);
>  extern void __init housekeeping_init(void);
>  
>  #else

While the extern declaration in the CONFIG_CPU_ISOLATION block is updated,
is the dummy static inline fallback function in the #else block lower down
missing the new flags parameter? 

It looks like it might still be defined as taking only the isol_mask
parameter, which would cause an API mismatch for builds without
CONFIG_CPU_ISOLATION.

[ ... ]

> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 258601ad20113..c24215e05b7b2 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
[ ... ]
> +/* HK type processing table */
> +static struct {
> +	int type;
> +	int boot_type;
> +} hk_types[] = {
> +	{ HK_TYPE_DOMAIN,       HK_TYPE_DOMAIN_BOOT	  },
> +	{ HK_TYPE_MANAGED_IRQ,  HK_TYPE_MANAGED_IRQ_BOOT  },
> +	{ HK_TYPE_KERNEL_NOISE, HK_TYPE_KERNEL_NOISE_BOOT }
> +};

This isn't a bug, but could this static array be marked as const so it is
placed in the read-only section instead of the mutable data section?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=2

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
  2026-04-21  3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
  2026-04-21  8:32   ` Thomas Gleixner
@ 2026-04-22  3:08   ` sashiko-bot
  1 sibling, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-hwmon, mkoutny, tj, hannes

Sashiko AI review found 1 potential issue(s):
- [High] Booting with `nohz_full` (no argument) silently disables `CONFIG_RCU_NOCB_CPU_DEFAULT_ALL`.
--

commit b2af6c59a7f3101f93296e77bcb1e7af45562ea8
Author: Waiman Long <longman@redhat.com>

tick/nohz: Make nohz_full parameter optional

This commit allows the nohz_full boot option to be specified without
any parameters, forcing it to be enabled without any CPUs initially in
the tick_nohz_full_mask.

[ ... ]

> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index b3337c7231ccb..2d06dcb61f379 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -1267,7 +1267,7 @@ void __init rcu_init_nohz(void)
>  	struct shrinker * __maybe_unused lazy_rcu_shrinker;
>  
>  #if defined(CONFIG_NO_HZ_FULL)
> -	if (tick_nohz_full_running && !cpumask_empty(tick_nohz_full_mask))
> +	if (tick_nohz_full_running)
>  		cpumask = tick_nohz_full_mask;
>  #endif

If the kernel is booted with nohz_full (no arguments), tick_nohz_full_running
will be true but tick_nohz_full_mask will be empty. This causes cpumask to
be assigned a non-NULL pointer to an empty mask.

Will this inadvertently disable CONFIG_RCU_NOCB_CPU_DEFAULT_ALL?

Looking at the code immediately following this change in rcu_init_nohz():

	if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_DEFAULT_ALL) &&
	    !rcu_state.nocb_is_setup && !cpumask)
		cpumask = cpu_possible_mask;

Since cpumask is no longer NULL, this fallback check evaluates to false.
The subsequent code will then OR with the empty tick_nohz_full_mask, resulting
in rcu_nocb_mask remaining empty and disabling RCU callback offloading for
all CPUs at boot.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=3

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
  2026-04-21  3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
@ 2026-04-22  3:08   ` sashiko-bot
  2026-04-22  9:34   ` Chen Ridong
  1 sibling, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: hannes, tj, mkoutny, linux-hwmon

Sashiko AI review found 1 potential issue(s):
- [High] The `hk_cpu` variable is evaluated only once outside the `while (1)` loop, causing it to use stale housekeeping status when iterating over fallback CPUs.
--

commit ad2cf9abc3cffbdb84c73b305a44e5027c5f5e64
Author: Waiman Long <longman@redhat.com>

arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
    
This commit uses RCU to protect access to the HK_TYPE_TICK cpumask to
prevent it from going away in the middle of the operation.

> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb1..48f1508016895 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
[ ... ]
> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
>  	unsigned int start_cpu = cpu;
>  	unsigned long last_update;
>  	unsigned int freq = 0;
> +	bool hk_cpu;
>  	u64 scale;
>  
>  	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
>  		return -EOPNOTSUPP;
>  
> +	scoped_guard(rcu) {
> +		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> +	}
> +
>  	while (1) {
>  
>  		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
>  		 * (and thus freq scale), if available, for given policy: this boils
>  		 * down to identifying an active cpu within the same freq domain, if any.
>  		 */
> -		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> +		if (!hk_cpu ||
>  		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>  			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);

Since hk_cpu is evaluated outside the while loop, does it use the
housekeeping status of the initial start_cpu for all iterations?

If the start_cpu is tickless (hk_cpu == false), it appears !hk_cpu will
remain permanently true for all loop iterations. This would cause the loop
to unconditionally reject every fallback cpu and eventually return -EAGAIN.

Conversely, if start_cpu is a housekeeping CPU with a stale sample, it
might improperly accept a tickless fallback cpu since hk_cpu remains true.

Should the hk_cpu evaluation be moved inside the while loop so it is
re-evaluated when the cpu variable changes?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=8

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/23] hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask
  2026-04-21  3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
  2026-04-21  8:59   ` Thomas Gleixner
@ 2026-04-22  3:09   ` sashiko-bot
  1 sibling, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-hwmon, mkoutny, hannes, tj

Sashiko AI review found 1 potential issue(s):
- [High] Missing bounds check on the CPU index returned by cpumask_any_and(), leading to potential out-of-bounds array access in per_cpu().
--

commit 9c505fb411d64c62ba8b88daa1385610377293b4
Author: Waiman Long <longman@redhat.com>

hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask

This commit updates the housekeeping cpumask access in get_target_base() to
be protected by an RCU read lock. This change prepares for the HK_TYPE_TIMER
cpumask becoming changeable at runtime.

> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 5bd6efe598f0f..bc3652f928b9d 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -242,8 +242,10 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
>  static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
>  {
>  	if (!hrtimer_base_is_online(base)) {
> -		int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
> +		int cpu;
>  
> +		guard(rcu)();
> +		cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));

As dynamic runtime updates of the housekeeping masks are being enabled, is it
possible for the intersection of cpu_online_mask and
housekeeping_cpumask(HK_TYPE_TIMER) to be empty?

If all housekeeping CPUs are offlined, cpumask_any_and() could return a value
greater than or equal to nr_cpu_ids.

>  		return &per_cpu(hrtimer_bases, cpu);

If cpu is greater than or equal to nr_cpu_ids, could this lead to an
out-of-bounds array access in per_cpu()?

Would it make sense to add a bounds check here to fall back safely if an
online housekeeping CPU is not found?

>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=11

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask
  2026-04-21  3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: mkoutny, tj, linux-hwmon, hannes

Sashiko AI review found 1 potential issue(s):
- [High] The commit message incorrectly claims that `housekeeping_cpumask()` will warn if used in an invalid context, masking potential RCU safety violations elsewhere in the kernel. Specifically, `housekeeping_dereference_check(type)` returns `true` for `HK_TYPE_KERNEL_NOISE`, causing `rcu_dereference_all_check()` to bypass the `rcu_read_lock_held()` validation. This false assumption of a safety net led the author to miss unprotected accesses, such as the `housekeeping_show` sysfs handler in `drivers/base/cpu.c` which does not hold `rcu_read_lock()`. Because `housekeeping_update()` replaces and frees the old mask via `kfree()` after `synchronize_rcu()`, concurrent execution of `housekeeping_show()` results in a Use-After-Free. While the lockdep omission is patched later in the series by commit "sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks", the underlying Use-After-Free vulnerability in `drivers/base/cpu.c` remains unfixed throughout the entire series.
--

commit 7a774b36f9d9b430bf7171534b53f6145556f483
Author: Waiman Long <longman@redhat.com>

sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask

This commit adds RCU protection to the HK_TYPE_KERNEL_NOISE cpumask in
get_nohz_timer_target(), as the mask can now be changed at runtime. It
also notes that housekeeping_cpumask() will warn if used in an invalid
context.

> As HK_TYPE_KERNEL_NOISE is going to be changeable at run time, use
> RCU to protect access to the cpumask when needed. Some access of the
> HK_TYPE_KERNEL_NOISE cpumask is done inside the tick code that has
> interrupt disabled which is a rcu_read_lock() critical section. Anyway,
> housekeeping_cpumask() will warn if they are used in invalid context.

Is this description accurate? Looking at housekeeping_dereference_check(),
it returns true for HK_TYPE_KERNEL_NOISE, which causes
rcu_dereference_all_check() to bypass the rcu_read_lock_held() validation.
This means lockdep will not warn about missing RCU read locks for this
specific mask.

Because the mask can now be replaced and freed dynamically after
synchronize_rcu(), does this introduce a use-after-free in other readers
that don't hold the RCU read lock?

For example, housekeeping_show() in drivers/base/cpu.c reads this mask to
print it to sysfs:

drivers/base/cpu.c:housekeeping_show() {
    ...
    return sysfs_emit(buf, "%*pbl\n",
                      cpumask_pr_args(housekeeping_cpumask(type)));
}

Since housekeeping_show() doesn't hold rcu_read_lock(), could a concurrent
update of the cpumask cause cpumask_pr_args() to access freed memory?

While the lockdep omission is patched later in this series by the commit
"sched/isolation: Extend housekeeping_dereference_check() to cover changes
in nohz_full or manged_irqs cpumasks", the underlying use-after-free
regression in drivers/base/cpu.c appears to remain unfixed throughout the
series.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=13

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask
  2026-04-21  3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-hwmon, tj, hannes, mkoutny

Sashiko AI review found 1 potential issue(s):
- [Medium] Mixing goto-based cleanup labels with scope-based cleanup helpers (`scoped_guard`) in the same function.
--

commit 83c63b96668998046d82d5fd34944cc02b7b7876
Author: Waiman Long <longman@redhat.com>

hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask

This patch wraps the housekeeping_cpu() check in a scoped_guard for RCU
since the HK_TYPE_MISC cpumask can change at runtime.

> diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
> --- a/drivers/hwmon/coretemp.c
> +++ b/drivers/hwmon/coretemp.c
> @@ -563,8 +563,10 @@ static int create_core_data(struct platform_device *pdev, unsigned int cpu,
>  	u32 eax, edx;
>  	int err;
>  
> -	if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> -		return 0;
> +	scoped_guard(rcu) {
> +		if (!housekeeping_cpu(cpu, HK_TYPE_MISC))
> +			return 0;
> +	}

Is it okay to use scoped_guard in a function that also relies on goto
labels for error handling?

Looking at include/linux/cleanup.h, the guidelines state that usage of
goto and cleanup helpers should never be mixed in the same function. We
should either convert all resources to scope-based cleanup or convert
none of them.

[ ... ]

>  	/* Test if we can access the status register */
>  	err = rdmsr_safe_on_cpu(cpu, tdata->status_reg, &eax, &edx);
>  	if (err)
>  		goto err;

[ ... ]

>  err:
>  	destroy_temp_data(pdata, tdata);
>  	return err;
>  }

Since create_core_data() still relies on the goto err label to clean up
tdata, would it be better to use explicit rcu_read_lock() and
rcu_read_unlock() around the housekeeping_cpu() check, or alternatively
convert tdata to use a cleanup helper?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=14

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
  2026-04-21  3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: tj, mkoutny, linux-hwmon, hannes

Sashiko AI review found 3 potential issue(s):
- [High] Infinite loop causing RCU stall when a NUMA node lacks housekeeping CPUs.
- [Medium] Mixing `goto`-based cleanup with scope-based `guard()` violates subsystem guidelines.
- [Low] Unnecessary unconditional memory allocation for non-performance channels.
--

[Note: Due to recitation restrictions, this report uses a free-form summary
format rather than quoting the original patch lines directly.]

commit bc4398753271c194097795d20b6095f594c74b4c
Author: Waiman Long <longman@redhat.com>
Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask

This commit uses RCU to protect the HK_TYPE_MANAGED_IRQ cpumask during access,
moving the cpumask allocation before taking the RCU read lock since the
allocation can sleep.

I noticed a few potential regressions in this patch:

First, looking at the loop in init_vp_index(), if a selected NUMA node has
no CPUs present in the housekeeping mask (for example, if the entire node is
isolated), the intersection of available_mask and hk_mask becomes empty.
The code then clears allocated_mask and jumps back to the retry label.

Since the variables do not change state across retries, does this create
an infinite loop? Because this patch places the loop inside an RCU
read-side critical section via guard(rcu)(), would this spin result in
an RCU stall?

Second, the patch introduces guard(rcu)() in init_vp_index() while retaining
the goto out_free cleanup path for free_cpumask_var(). According to the
kernel's cleanup guidelines, mixing goto-based error handling with
scope-based cleanups like guard() in the same function creates confusing
ownership semantics. 

Should this function be fully converted to scope-based cleanup, or 
alternatively, use explicit rcu_read_lock() and rcu_read_unlock() calls 
to avoid this mix?

Finally, before this patch, the code used short-circuit logic to skip
allocating the cpumask if the channel was not performance-critical. The
patch refactors this to unconditionally call alloc_cpumask_var() at the top
of the function before evaluating the channel type. 

Does this introduce unnecessary sleepable memory allocation and deallocation 
cycles for non-performance channels?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=15

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks
  2026-04-21  3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: hannes, linux-hwmon, mkoutny, tj

Sashiko AI review found 2 potential issue(s):
- [Critical] Unprotected access to newly dynamic HK_TYPE_KERNEL_NOISE causes lockdep splats and Use-After-Free.
- [Critical] Flawed lockdep exemptions mask genuine Use-After-Free vulnerabilities.
--

commit 7b12dcd17d401c6b135ee53ba4e8384a7bd3d867
Author: Waiman Long <longman@redhat.com>

sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks

This commit extends housekeeping_dereference_check() to apply the same
lockdep exemptions used by HK_TYPE_DOMAIN to HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ, preparing them to be runtime changeable.

> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -38,7 +38,8 @@ EXPORT_SYMBOL_GPL(housekeeping_enabled);
>  
>  static bool housekeeping_dereference_check(enum hk_type type)
>  {
> -	if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
> +	if (IS_ENABLED(CONFIG_LOCKDEP) &&
> +	   (BIT(type) & (HK_FLAG_DOMAIN | HK_FLAG_KERNEL_NOISE | HK_FLAG_MANAGED_IRQ))) {

Because HK_TYPE_KERNEL_NOISE is becoming runtime changeable, does this
expose existing readers to concurrent updates and use-after-free?

For example, housekeeping_show() reads this mask without holding
rcu_read_lock(). Similarly, tmigr_isolated_exclude_cpumask() reads it
holding only cpus_read_lock(), not the write lock.

If housekeeping_update() allocates a new mask, calls synchronize_rcu(),
and frees the old mask while these functions are accessing the pointer,
could this result in a use-after-free?

Additionally, does extending this lockdep exemption mask genuine
use-after-free vulnerabilities?

The exemptions below assume that holding cpuset_mutex or cpus_write_lock()
protects the pointer:

>  		/* Cpuset isn't even writable yet? */
>  		if (system_state <= SYSTEM_SCHEDULING)
>  			return true;
> 
>  		/* CPU hotplug write locked, so cpuset partition can't be overwritten */
>  		if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
>  			return true;
> 
>  		/* Cpuset lock held, partitions not writable */
>  		if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
>  			return true;

However, housekeeping_update() explicitly drops cpuset_mutex and
cpus_read_lock() before updating masks and calling synchronize_rcu().

This means housekeeping_update() can run in parallel with readers
holding cpuset_mutex or CPU hotplug callbacks holding cpus_write_lock().

Since these readers are exempted from needing rcu_read_lock(),
synchronize_rcu() will not wait for them. Could this result in a silent
use-after-free on a freed cpumask?

>  		return false;
>  	}
> 
>  	return true;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=17

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
  2026-04-21  3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
  2026-04-21 16:17   ` Thomas Gleixner
@ 2026-04-22  3:09   ` sashiko-bot
  1 sibling, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-hwmon, hannes, tj, mkoutny

Sashiko AI review found 4 potential issue(s):
- [High] Early loop break in the rollback path leaves successfully offlined CPUs permanently offline.
- [High] Race condition in the `device_online()` retry loop bypasses recovery if hotplug is re-enabled immediately.
- [High] Hardcoded 100ms timeout for `cpu_hotplug_disabled` is brittle due to lock dropping, leading to permanently stranded CPUs.
- [Medium] Time-of-check to time-of-use (TOCTOU) race on `cpu_online_mask` validation triggers spurious kernel warnings and API failure.
--

commit ebe2a08265b3e2e79b0d74906f6ed951ac37520b
Author: Waiman Long <longman@redhat.com>

cpu/hotplug: Add a new cpuhp_offline_cb() API

This patch introduces the cpuhp_offline_cb() API to offline a set of CPUs,
execute a callback function, and then bring them back online. It allows
runtime adjustment of command line options like nohz_full and isolcpus.
A new flag, cpuhp_offline_cb_mode, is added to optimize hotplug operations
while the system is in this transient state.

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 0d02b5d7a7baf..9b32f742cd1da 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1520,6 +1520,76 @@ int remove_cpu(unsigned int cpu)
>  }
>  EXPORT_SYMBOL_GPL(remove_cpu);
>  
> +bool cpuhp_offline_cb_mode;
> +
> +/**
> + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
> + * @mask: A mask of CPUs to be taken offline and then online
> + * @func: A callback function to be invoked while the given CPUs are offline
> + * @arg:  Argument to be passed back to the callback function
> + *
> + * Return: 0 if successful, an error code otherwise
> + */
> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
> +{
> +	int off_cpu, on_cpu, ret, ret2 = 0;
> +
> +	if (WARN_ON_ONCE(cpumask_empty(mask) ||
> +	   !cpumask_subset(mask, cpu_online_mask)))
> +		return -EINVAL;

Since the mask validation checks cpu_online_mask before acquiring
lock_device_hotplug(), could a concurrent user-initiated sysfs CPU offline
operation remove a CPU from cpu_online_mask after the caller generates the
mask? This would trigger a spurious WARN_ON_ONCE and fail with -EINVAL.
Should this validation be moved under the hotplug lock?

> +
> +	pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));
> +	lock_device_hotplug();
> +	cpuhp_offline_cb_mode = true;

[ ... ]

> +	/* Bring previously offline CPUs back online */
> +	for_each_cpu(on_cpu, mask) {
> +		int retries = 0;
> +
> +		if (on_cpu == off_cpu)
> +			break;
> +
> +retry:
> +		ret2 = device_online(get_cpu_device(on_cpu));
> +
> +		/*
> +		 * With the unlikely event that CPU hotplug is disabled while
> +		 * this operation is in progress, we will need to wait a bit
> +		 * for hotplug to hopefully be re-enabled again. If not, print
> +		 * a warning and return the error.
> +		 *
> +		 * cpu_hotplug_disabled is supposed to be accessed while
> +		 * holding the cpu_add_remove_lock mutex. So we need to
> +		 * use the data_race() macro to access it here.
> +		 */
> +		while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
> +		       (++retries <= 5)) {

If device_online() returns -EBUSY, but another thread re-enables CPU hotplug
right before the while loop condition is evaluated,
data_race(cpu_hotplug_disabled) will evaluate to false. Doesn't this
short-circuit the loop entirely, leaving ret2 == -EBUSY and bypassing the
retry mechanism exactly when recovery would have succeeded?

Additionally, is the hardcoded 100ms (5 * 20ms) timeout sufficient?
Concurrent paths like PCI probing or idle injection might hold the
cpu_add_remove_lock for several seconds via cpu_hotplug_disable(). If the
timeout expires during the online phase, won't this abort the API and
permanently abandon the CPUs in an offline state?

> +			msleep(20);
> +			if (!data_race(cpu_hotplug_disabled))
> +				goto retry;
> +		}
> +		if (ret2) {
> +			pr_warn("%s: Failed to bring CPU %d back online!\n",
> +				__func__, on_cpu);
> +			break;

If device_online() fails to bring a CPU back online and breaks out of the
loop here, doesn't it prematurely exit the restoration loop? This would skip
any remaining CPUs in the mask that were successfully taken offline, leaving
them permanently offline and reducing system capacity. Should this use
continue instead of break to ensure all offlined CPUs have a chance to be
restored?

> +		}
> +	}
> +	cpuhp_offline_cb_mode = false;
> +	unlock_device_hotplug();
> +	pr_debug("%s: end\n", __func__);
> +	return ret ? ret : ret2;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=18

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks
  2026-04-21  3:03 ` [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: tj, linux-hwmon, hannes, mkoutny

Sashiko AI review found 6 potential issue(s):
- [High] Stale mask used for DOMAIN isolation when `tick_nohz_full_enabled()` is false, permanently breaking CPU isolation updates.
- [High] Dynamic CPU de-isolation fails silently for `HK_TYPE_KERNEL_NOISE` and `HK_TYPE_MANAGED_IRQ` masks, causing isolated CPUs to grow monotonically.
- [High] Lack of rollback on `housekeeping_update()` or hotplug failures causes unrecoverable state desynchronization.
- [High] `cpuset_nohz_update_cbfunc()` silently swallows the return value of `housekeeping_update()`, hiding failures from the hotplug callback.
- [High] Housekeeping updates triggered by partition invalidations are permanently lost during CPU offline callbacks.
- [Medium] TOCTOU race when reading `cpu_online_mask` after dropping `cpus_read_lock()` leads to potential hotplug abortion and state inconsistency.
--

commit 35eda8af3da1589e50363cd927f6b3b902b398a2
Author: Waiman Long <longman@redhat.com>
Subject: cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks

This commit enables runtime updates for KERNEL_NOISE and MANAGED_IRQ
housekeeping cpumasks. It uses CPU hotplug to facilitate transitioning CPUs
into and out of isolated states when a nohz_full boot parameter is provided.

> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 1b0c50b46a49c..a927b9cd4f718 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c

[ ... ]

> @@ -1328,29 +1336,67 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  	return false;
>  }
>  
> +static int cpuset_nohz_update_cbfunc(void *arg)
> +{
> +	struct cpumask *isol_cpus = (struct cpumask *)arg;
> +
> +	if (isol_cpus)
> +		housekeeping_update(isol_cpus, BIT(HK_TYPE_KERNEL_NOISE));
> +	return 0;
> +}

Is it intentional to discard the return value of housekeeping_update() here?
If housekeeping_update() fails, returning 0 will cause cpuhp_offline_cb()
to assume success, potentially hiding failures from the hotplug callback.

[ ... ]

> +	/*
> +	 * Update isolated_nohz_cpus/isolated_mirq_cpus if necessary
> +	 */
> +	if (!boot_nohz_le_domain) {
> +		cpumask_andnot(cpus, cpu_possible_mask,
> +			       housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
> +		cpumask_or(cpus, cpus, isolated_cpus);

Could this prevent dynamic CPU de-isolation? Because housekeeping_cpumask()
returns the current runtime mask rather than the boot mask, the newly
isolated CPUs remain in the mask when cpumask_or() is called. This might
cause isolated CPUs to grow monotonically rather than allowing removal.

> +		update_nohz = !cpumask_equal(isolated_nohz_cpus, cpus);
> +		if (update_nohz)
> +			cpumask_copy(isolated_nohz_cpus, cpus);

If a failure occurs later in this function, these tracking masks are updated
but the state is not. For example, if cpuhp_offline_cb() fails, the code
jumps to out_free. If userspace then retries the operation, won't it see
cpumask_equal(isolated_cpus, isolated_hk_cpus) as true and return early,
leaving the system permanently out of sync?

[ ... ]

> @@ -1360,10 +1406,103 @@ static void cpuset_update_sd_hk_unlock(void)
>  	 */
>  	mutex_unlock(&cpuset_mutex);
>  	cpus_read_unlock();

[ ... ]

> +		/*
> +		 * Mask out offline CPUs in cpus
> +		 * If there is no online CPUs, we can call
> +		 * housekeeping_update() directly if needed.
> +		 */
> +		cpumask_and(cpus, cpus, cpu_online_mask);

Can this race with CPU hotplug operations? The cpus_read_lock was released
earlier in this function. If a CPU is offlined concurrently right after this
check, cpuhp_offline_cb() might be passed a CPU that is no longer online,
potentially causing the hotplug loop to abort.

[ ... ]

> +/*
> + * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
> + *
> + * Update housekeeping cpumasks and rebuild sched domains if necessary and
> + * then do a cpuset_full_unlock().
> + * This should be called at the end of cpuset operation.
> + */
> +static void cpuset_update_sd_hk_unlock(void)
> +	__releases(&cpuset_mutex)
> +	__releases(&cpuset_top_mutex)
> +{
> +	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
> +	if (force_sd_rebuild)
> +		rebuild_sched_domains_locked();
> +
> +	update_housekeeping = false;
> +
> +	if (cpumask_equal(isolated_cpus, isolated_hk_cpus)) {
> +		cpuset_full_unlock();
> +		return;
> +	}
> +
> +	if (!tick_nohz_full_enabled()) {
> +		/*
> +		 * housekeeping_update() is now called without holding
> +		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
> +		 * is still being held for mutual exclusion.
> +		 */
> +		mutex_unlock(&cpuset_mutex);
> +		cpus_read_unlock();
> +		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus,
> +						 BIT(HK_TYPE_DOMAIN)));

Are we passing a stale mask here? If tick_nohz_full_enabled() is false, we
skip calling cpuset_update_housekeeping_unlock(), which means
isolated_hk_cpus is never updated to match isolated_cpus. Subsequent checks
for cpumask_equal(isolated_cpus, isolated_hk_cpus) will continue to fail,
and the new mask is never applied to the DOMAIN isolation.

> +		mutex_unlock(&cpuset_top_mutex);
> +	} else {
> +		cpuset_update_housekeeping_unlock();
> +	}
> +}

[ ... ]

> @@ -3954,7 +4116,10 @@ static void cpuset_handle_hotplug(void)
>  	 */
>  	if (force_sd_rebuild)
>  		rebuild_sched_domains_cpuslocked();
> -	if (update_housekeeping)
> +	/*
> +	 * Don't need to update housekeeping cpumasks in cpuhp_offline_cb mode.
> +	 */
> +	if (update_housekeeping && !cpuhp_offline_cb_mode)
>  		queue_work(system_dfl_wq, &hk_sd_work);
>  
>  	free_tmpmasks(ptmp);

Does this cause housekeeping updates to be lost? If a CPU is offlined and
it invalidates a cpuset partition, update_housekeeping is set to true. If we
skip queuing the work here because cpuhp_offline_cb_mode is true, there
doesn't appear to be a mechanism to retroactively process the pending flag
once the bounce cycle completes.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=20

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition
  2026-04-21  3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-hwmon, mkoutny, hannes, tj

Sashiko AI review found 3 potential issue(s):
- [High] The fallback CPU mask is overwritten and ignored in `cpuset_update_tasks_cpumask`, causing task migration to fail silently.
- [High] The `cpuhp_offline_cb_mode` hack bypasses partition invalidation globally, but its fallback logic only protects `PRS_ISOLATED`, breaking standard `PRS_ROOT` partitions.
- [High] Unsynchronized access to the global `cpuhp_offline_cb_mode` flag creates a data race and TOCTOU vulnerability in partition validation.
--

commit 273c35c4b20a0b62c4d54c5410e48d5450b71df4
Author: Waiman Long <longman@redhat.com>

cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition

CPU hotplug is used to facilitate the modification of the
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ cpumasks. However, tearing
down and bringing up CPUs can impact the cpuset partition states
as well. This patch attempts to work around partition invalidation by
pretending partitions are empty during cpuhp_offline_cb_mode, and
temporarily mapping tasks in isolated partitions to their parent's
effective CPUs.

> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a927b9cd4f718..5f6b4e67748f2 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -434,6 +434,13 @@ static inline bool partition_is_populated(struct cpuset *cs,
>  	struct cpuset *cp;
>  	struct cgroup_subsys_state *pos_css;
>  
> +	/*
> +	 * Hack: In cpuhp_offline_cb_mode, pretend all partitions are empty
> +	 * to prevent unnecessary partition invalidation.
> +	 */
> +	if (cpuhp_offline_cb_mode)
> +		return false;
> +

Is it safe to read cpuhp_offline_cb_mode here without synchronization?

The hotplug thread sets this flag while holding lock_device_hotplug(). A
concurrent userspace write to cpuset.cpus.partition triggers
validate_partition(), which holds cpuset_mutex and cpus_read_lock(), but not
lock_device_hotplug().

Could validate_partition() read cpuhp_offline_cb_mode as true before the
hotplug thread acquires cpus_write_lock(), causing the safety checks to
falsely conclude the partition is empty and bypassing cgroup validation?

[ ... ]

> @@ -3881,6 +3888,17 @@ hotplug_update_tasks(struct cpuset *cs,
>  	cs->effective_mems = *new_mems;
>  	spin_unlock_irq(&callback_lock);
>  
> +	/*
> +	 * When cpuhp_offline_cb_mode is active, valid isolated partition
> +	 * with tasks may have no online CPUs available for a short while.
> +	 * In that case, we fall back to parent's effective CPUs temporarily
> +	 * which will be reset back to their rightful value once the affected
> +	 * CPUs are online again.
> +	 */
> +	if (cpuhp_offline_cb_mode && cpumask_empty(new_cpus) &&
> +	   (cs->partition_root_state == PRS_ISOLATED))
> +		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
> +

By bypassing partition_is_populated() globally during cpuhp_offline_cb_mode,
all partitions skip normal invalidation. However, this fallback logic strictly
checks for PRS_ISOLATED.

If a standard PRS_ROOT partition loses its last online CPU, it gets bypassed
for invalidation but receives no fallback mask. Won't this leave the PRS_ROOT
partition with a completely empty effective_cpus mask, breaking its isolation
and causing subsequent cgroup_attach() operations to fail with -ENOSPC?

>  	if (cpus_updated)
>  		cpuset_update_tasks_cpumask(cs, new_cpus);

Does the temporary fallback mask actually get applied to the tasks?

Earlier in hotplug_update_tasks(), cs->effective_cpus is overwritten with the
empty incoming new_cpus mask. Then, inside cpuset_update_tasks_cpumask(),
new_cpus is overwritten again by intersecting it with cs->effective_cpus:

kernel/cgroup/cpuset.c:cpuset_update_tasks_cpumask() {
    ...
		} else {
			cpumask_and(new_cpus, possible_mask, cs->effective_cpus);
		}
		set_cpus_allowed_ptr(task, new_cpus);
    ...
}

Since cs->effective_cpus is already empty, the resulting new_cpus mask passed
to set_cpus_allowed_ptr() will also be empty. Doesn't set_cpus_allowed_ptr()
fail with -EINVAL when given an empty mask, leaving the task's affinity
unchanged and forcing the scheduler to migrate the tasks to arbitrary fallback
CPUs system-wide?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=21

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in isolated partition
  2026-04-21  3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: hannes, mkoutny, linux-hwmon, tj

Sashiko AI review found 3 potential issue(s):
- [High] Unconditional NULL pointer dereference causes boot-time kernel panic on architectures with late CPU registration or systems with non-present possible CPUs.
- [Medium] Caching the `offline_disabled` state at early boot misses dynamic runtime updates, allowing illegal CPUs to enter isolated partitions.
- [Low] Buffer truncation for the isolated partition error message when multiple CPUs have `offline_disabled` set.
--

commit b13420f1f2cd136dd5780ea125eec8631214ef49
Author: Waiman Long <longman@redhat.com>

cgroup/cpuset: Prevent offline_disabled CPUs from being used in isolated partition

This commit introduces a check to prevent CPUs with the offline_disabled
flag set from being added to isolated cpuset partitions. It initializes
a cpumask during late boot to track these CPUs and checks against it
when partition states are updated.

> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 5f6b4e67748f2..f3af8ef6c64ee 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c

[ ... ]

> @@ -164,6 +167,12 @@ static cpumask_var_t	isolated_mirq_cpus;	/* T */
>  static bool		boot_nohz_le_domain __ro_after_init;
>  static bool		boot_mirq_le_domain __ro_after_init;
>  
> +/*
> + * Cpumask of CPUs with offline_disabled set
> + * The cpumask is effectively __ro_after_init.
> + */
> +static cpumask_var_t	offline_disabled_cpus;

Is it safe to assume this mask is effectively ro_after_init? On some
architectures like ARM64, the offline_disabled state can be set dynamically
at runtime, such as when an asymmetric 32-bit EL0 CPU is brought online.
Does caching this state at boot miss these dynamic updates, potentially
allowing dynamically restricted CPUs to enter isolated partitions later on?

[ ... ]

> @@ -4188,6 +4240,39 @@ void __init cpuset_init_smp(void)
>  	BUG_ON(!cpuset_migrate_mm_wq);
>  }
>  
> +/**
> + * cpuset_init_late - initialize the list of CPUs with offline_disabled set
> + *
> + * Description: Initialize a cpumask with CPUs that have the offline_disabled
> + *		bit set. It is done in a separate initcall as cpuset_init_smp()
> + *		is called before driver_init() where the CPU devices will be
> + *		set up.
> + */
> +static int __init cpuset_init_late(void)
> +{
> +	int cpu;
> +
> +	if (!tick_nohz_full_enabled())
> +		return 0;
> +	/*
> +	 * Iterate all the possible CPUs to see which one has offline disabled.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		if (get_cpu_device(cpu)->offline_disabled)

Can get_cpu_device() return NULL here? Because this is a pure_initcall,
it runs before topology_init() on architectures like PowerPC, MIPS, and
SPARC, meaning the CPU devices might not be registered yet. 

Furthermore, on systems with physical CPU hotplug where possible_cpus exceeds
present_cpus, the non-present possible CPUs might also return NULL. Could
this lead to a NULL pointer dereference during early boot?

> +			__cpumask_set_cpu(cpu, offline_disabled_cpus);
> +	}
> +	if (!cpumask_empty(offline_disabled_cpus)) {
> +		char buf[128];
> +
> +		snprintf(buf, sizeof(buf),
> +			 "CPU %*pbl with offline disabled not allowed in isolated partition",
> +			 cpumask_pr_args(offline_disabled_cpus));

This isn't a bug, but could the 128-byte buffer silently truncate if there
are multiple non-contiguous offline-disabled CPUs? The static text takes up
59 bytes, leaving 68 bytes for the cpumask string. If the formatted string is
longer, the resulting error message surfaced to userspace via perr_strings
might be incomplete.

> +		perr_strings[PERR_OL_DISABLED] = kstrdup(buf, GFP_KERNEL);
> +	}
> +	return 0;
> +}
> +pure_initcall(cpuset_init_late);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=22

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates
  2026-04-21  3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long
@ 2026-04-22  3:09   ` sashiko-bot
  0 siblings, 0 replies; 56+ messages in thread
From: sashiko-bot @ 2026-04-22  3:09 UTC (permalink / raw)
  To: Waiman Long; +Cc: tj, hannes, linux-hwmon, mkoutny

Sashiko AI review found 5 potential issue(s):
- [Medium] The newly added `nohz_full` check in `check_isolcpus()` is placed before the state stabilization retry loop, leading to false test failures and bypassed validation.
- [Medium] The selftest script enables kernel dynamic debugging globally but fails to restore its state when the test finishes or aborts.
- [Medium] The new `CPU0_ISOLCPUS_MATRIX` test cases omit the 12th field (`ISOLCPUS`), silently bypassing validation of the `cpuset.cpus.isolated` state.
- [Low] Dead code: the variable `CHK_NOHZ_FULL` is defined and assigned but never used.
- [Low] Grammatical errors and typos in documentation and commit message.
--

commit 84d90aacf6dde2c021ed021a8f60057d923655be
Author: Waiman Long <longman@redhat.com>

cgroup/cpuset: Documentation and kselftest updates

This commit updates cpuset tests and documentation to reflect new CPU
hotplug capabilities for nohz_full. It adds tests to prevent using CPU 0
in isolated partitions when it cannot be brought offline, and updates
documentation in cgroup-v2.rst.

> A new test will also be run in offline isn't allowed in CPU 0 to verify
> that using CPU 0 as part of an isolated partition will fail.

This isn't a bug, but should "run in offline" be "run if offline"?

> Since there is a slight performance overhead to enable runtime changes
> to nohz_full CPU list, users have to explicitly opt in by adding a
> "nohz_ful" kernel command line parameter with or without a CPU list.

This isn't a bug, but there's a typo ("nohz_ful") in the commit message.

> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 91beaa6798ce0..81181dc100ead 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
[ ... ]
> @@ -2647,11 +2648,29 @@ Cpuset Interface Files
>  	partition or scheduling domain.  The set of exclusive CPUs is
>  	determined by the value of its "cpuset.cpus.exclusive.effective".
>  
> -	When set to "isolated", the CPUs in that partition will be in
> -	an isolated state without any load balancing from the scheduler
> -	and excluded from the unbound workqueues.  Tasks placed in such
> -	a partition with multiple CPUs should be carefully distributed
> -	and bound to each of the individual CPUs for optimal performance.
> +	When set to "isolated", the CPUs in that partition will be in an
> +	isolated state without any load balancing from the scheduler and
> +	excluded from the unbound workqueues as well as other OS noises.
> +	Tasks placed in such a partition with multiple CPUs should be
> +	carefully distributed and bound to each of the individual CPUs
> +	for optimal performance.
> +
> +	As CPU hotplug, if supported, is used to improve the degree of
> +	CPU isolation close to the "nohz_full" kernel boot parameter.

This isn't a bug, but this sentence seems to be a subordinate clause lacking
a main clause.

> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index a56f4153c64df..eebb4122b581c 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -67,6 +67,12 @@ then
>  	echo Y > /sys/kernel/debug/sched/verbose
>  fi
>  
> +# Enable dynamic debug message if available
> +DYN_DEBUG=/proc/dynamic_debug/control
> +[[ -f $DYN_DEBUG ]] && {
> +	echo "file kernel/cpu.c +p" > $DYN_DEBUG
> +}

Does this leak the debug state after the test finishes?

The script enables dynamic debug messages for the CPU hotplug subsystem here.

Looking at the cleanup() function triggered via trap, it does not restore or
disable this configuration, which permanently alters the system's global debug
state.

Should the cleanup() function revert this change?

[ ... ]
> @@ -84,6 +90,15 @@ echo member > test/cpuset.cpus.partition
>  echo "" > test/cpuset.cpus
>  [[ $RESULT -eq 0 ]] && skip_test "Child cgroups are using cpuset!"
>  
> +#
> +# If nohz_full parameter is specified and nohz_full file exists, CPU hotplug
> +# will be used to modify nohz_full cpumask to include all the isolated CPUs
> +# in cpuset isolated partitions.
> +#
> +NOHZ_FULL=/sys/devices/system/cpu/nohz_full
> +BOOT_NOHZ_FULL=$(fmt -1 /proc/cmdline | grep "^nohz_full")
> +[[ "$BOOT_NOHZ_FULL" = nohz_full ]] && CHK_NOHZ_FULL=1

Is CHK_NOHZ_FULL used anywhere else in the script?

It looks like it is defined and assigned here, but the same string
comparison is redundantly evaluated later inside check_isolcpus() without
using this variable.

[ ... ]
> +#
> +# Test matrix to verify that using CPU 0 in isolated (local or remote) partition
> +# will fail when offline isn't allowed for CPU 0.
> +#
> +CPU0_ISOLCPUS_MATRIX=(
> +	#  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
> +	#  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
> +	"   C0-3     .      .    C4-5     P2     .      .      .     0 A1:0-3|B1:4-5 A1:P-2"
> +	"   C1-3     .      .      .      P2     .      .      .     0 A1:1-3 A1:P2"
> +	"   C1-3     .      .      .    P2:C0-3  .      .      .     0 A1:0-3 A1:P-2"
> +	"  CX0-3   C0-3     .      .       .     P2     .      .     0 A1:0-3|A2:0-3 A2:P-2"
> +	"  CX0-3 C0-3:X1-3  .      .       .     P2     .      .     0 A1:0|A2:1-3 A2:P2"
> +	"  CX0-3 C0-3:X1-3  .      .       .   P2:X0-3  .      .     0 A1:0-3|A2:0-3 A2:P-2"
> +)

Are these test cases bypassing the validation of the cpuset.cpus.isolated
state?

The test matrix headers indicate 12 fields, but the entries only provide
11 fields, omitting the final ISOLCPUS field.

Because the 12th field is missing, check_test_results() will evaluate if
ISOLCPUS is non-empty as false and skip calling check_isolcpus(), leaving
the isolated state unchecked.

[ ... ]
> @@ -789,6 +821,18 @@ check_isolcpus()
>  		EXPECTED_SDOMAIN=$EXPECTED_ISOLCPUS
>  	fi
>  
> +	#
> +	# Check if nohz_full match cpuset.cpus.isolated if nohz_boot parameter
> +	# specified with no parameter.
> +	#

This isn't a bug, but there's a typo ("nohz_boot" instead of "nohz_full")
in this comment.

> +	[[ -f $NOHZ_FULL && "$BOOT_NOHZ_FULL" = nohz_full ]] && {
> +		NOHZ_FULL_CPUS=$(cat $NOHZ_FULL)
> +		[[ "$ISOLCPUS" != "$NOHZ_FULL_CPUS" ]] && {
> +			echo "nohz_full ($NOHZ_FULL_CPUS) does not match cpuset.cpus.isolated ($ISOLCPUS)"
> +			return 1
> +		}
> +	}
> +

Will this check cause false test failures or bypass validation due to
asynchronous state updates?

The script reads the state and immediately returns an error if they mismatch.

However, later in the function there is a 50ms retry loop to account for
transient states.

If the files update at slightly different times, the check instantly fails.
If both files are in their old state, they match, passing the check, but
then the script pauses and reads the new state without re-evaluating
nohz_full against the final state.

Could this check be moved after the state stabilization retry loop?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=23

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask
  2026-04-21  3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
  2026-04-22  3:08   ` sashiko-bot
@ 2026-04-22  6:39   ` Chen Ridong
  1 sibling, 0 replies; 56+ messages in thread
From: Chen Ridong @ 2026-04-22  6:39 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan



On 2026/4/21 11:03, Waiman Long wrote:
> The housekeeping_update() function currently allows update to the
> HK_TYPE_DOMAIN cpumask only. As we are going to enable dynamic
> modification of the other housekeeping cpumasks, we need to extend
> it to support passing in the information about the HK cpumask(s) to
> be updated.  In cases where some HK cpumasks happen to be the same,
> it will be more efficient to update multiple HK cpumasks in one single
> call instead of calling it multiple times. Extend housekeeping_update()
> to support that as well.
> 
> Also add the restriction that passed in isolated cpumask parameter
> of housekeeping_update() must include all the CPUs isolated at boot
> time. This is currently the case for cpuset anyway.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/sched/isolation.h |  2 +-
>  kernel/cgroup/cpuset.c          |  2 +-
>  kernel/sched/isolation.c        | 99 +++++++++++++++++++++++----------
>  3 files changed, 71 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index d1707f121e20..a17f16e0156e 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -51,7 +51,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
>  extern bool housekeeping_enabled(enum hk_type type);
>  extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
>  extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
> -extern int housekeeping_update(struct cpumask *isol_mask);
> +extern int housekeeping_update(struct cpumask *isol_mask, unsigned long flags);
>  extern void __init housekeeping_init(void);
>  
>  #else
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 1335e437098e..a4eccb0ec0d1 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1354,7 +1354,7 @@ static void cpuset_update_sd_hk_unlock(void)
>  		 */
>  		mutex_unlock(&cpuset_mutex);
>  		cpus_read_unlock();
> -		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
> +		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
>  		mutex_unlock(&cpuset_top_mutex);
>  	} else {
>  		cpuset_full_unlock();
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 9ec9ae510dc7..965d6f8fe344 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -120,48 +120,87 @@ bool housekeeping_test_cpu(int cpu, enum hk_type type)
>  }
>  EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
>  
> -int housekeeping_update(struct cpumask *isol_mask)
> -{
> -	struct cpumask *trial, *old = NULL;
> -	int err;
> +/* HK type processing table */
> +static struct {
> +	int type;
> +	int boot_type;
> +} hk_types[] = {
> +	{ HK_TYPE_DOMAIN,       HK_TYPE_DOMAIN_BOOT	  },
> +	{ HK_TYPE_MANAGED_IRQ,  HK_TYPE_MANAGED_IRQ_BOOT  },
> +	{ HK_TYPE_KERNEL_NOISE, HK_TYPE_KERNEL_NOISE_BOOT }
> +};
>  
> -	trial = kmalloc(cpumask_size(), GFP_KERNEL);
> -	if (!trial)
> -		return -ENOMEM;
> +#define HK_TYPE_CNT	ARRAY_SIZE(hk_types)
>  
> -	cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), isol_mask);
> -	if (!cpumask_intersects(trial, cpu_online_mask)) {
> -		kfree(trial);
> -		return -EINVAL;
> +int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
> +{
> +	struct cpumask *trial[HK_TYPE_CNT];
> +	int i, err = 0;
> +
> +	for (i = 0; i < HK_TYPE_CNT; i++) {
> +		int type = hk_types[i].type;
> +		int boot = hk_types[i].boot_type;
> +
> +		trial[i] = NULL;
> +		if (flags & BIT(type)) {
> +			trial[i] = kmalloc(cpumask_size(), GFP_KERNEL);
> +			if (!trial[i]) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +			/*
> +			 * The new HK cpumask must be a subset of its boot
> +			 * cpumask.
> +			 */
> +			cpumask_andnot(trial[i], cpu_possible_mask, isol_mask);
> +			if (!cpumask_intersects(trial[i], cpu_online_mask) ||
> +			    !cpumask_subset(trial[i], housekeeping_cpumask(boot))) {
> +				i++;
> +				err = -EINVAL;
> +				goto out;
> +			}
> +		}
>  	}
>  

The i++ here is confusing. Wouldn't it be more readable to just use
kfree(trial[i]) and then break out?

>  	if (!housekeeping.flags)
>  		static_branch_enable(&housekeeping_overridden);
>  
> -	if (housekeeping.flags & HK_FLAG_DOMAIN)
> -		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
> -	else
> -		WRITE_ONCE(housekeeping.flags, housekeeping.flags | HK_FLAG_DOMAIN);
> -	rcu_assign_pointer(housekeeping.cpumasks[HK_TYPE_DOMAIN], trial);
> -
> -	synchronize_rcu();
> -
> -	pci_probe_flush_workqueue();
> -	mem_cgroup_flush_workqueue();
> -	vmstat_flush_workqueue();
> +	for (i = 0; i < HK_TYPE_CNT; i++) {
> +		int type =  hk_types[i].type;
> +		struct cpumask *old;
>  
> -	err = workqueue_unbound_housekeeping_update(housekeeping_cpumask(HK_TYPE_DOMAIN));
> -	WARN_ON_ONCE(err < 0);
> +		if (!trial[i])
> +			continue;
> +		old = NULL;
> +		if (housekeeping.flags & BIT(type))
> +			old = housekeeping_cpumask_dereference(type);
> +		rcu_assign_pointer(housekeeping.cpumasks[type], trial[i]);
> +		trial[i] = old;
> +	}
>  
> -	err = tmigr_isolated_exclude_cpumask(isol_mask);
> -	WARN_ON_ONCE(err < 0);
> +	if ((housekeeping.flags & flags) != flags)
> +		WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
>  
> -	err = kthreads_update_housekeeping();
> -	WARN_ON_ONCE(err < 0);
> +	synchronize_rcu();
>  
> -	kfree(old);
> +	if (flags & HK_FLAG_DOMAIN) {
> +		/*
> +		 * HK_TYPE_DOMAIN specific callbacks
> +		 */
> +		pci_probe_flush_workqueue();
> +		mem_cgroup_flush_workqueue();
> +		vmstat_flush_workqueue();
> +
> +		WARN_ON_ONCE(workqueue_unbound_housekeeping_update(
> +				housekeeping_cpumask(HK_TYPE_DOMAIN)) < 0);
> +		WARN_ON_ONCE(tmigr_isolated_exclude_cpumask(isol_mask) < 0);
> +		WARN_ON_ONCE(kthreads_update_housekeeping() < 0);
> +	}
>  
> -	return 0;
> +out:
> +	while (--i >= 0)
> +		kfree(trial[i]);
> +	return err;
>  }
>  
>  void __init housekeeping_init(void)

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
  2026-04-21  3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
  2026-04-22  3:08   ` sashiko-bot
@ 2026-04-22  9:34   ` Chen Ridong
  1 sibling, 0 replies; 56+ messages in thread
From: Chen Ridong @ 2026-04-22  9:34 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan



On 2026/4/21 11:03, Waiman Long wrote:
> As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
> need to use RCU to protect access to the cpumask to prevent it from
> going away in the middle of the operation.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  arch/arm64/kernel/topology.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b32f13358fbb..48f150801689 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
>  	if (!amu_fie_cpu_supported(cpu))
>  		return;
>  
> +	guard(rcu)();
>  	/* Kick in AMU update but only if one has not happened already */
>  	if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
>  	    time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update, cpu)))
> @@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
>  	unsigned int start_cpu = cpu;
>  	unsigned long last_update;
>  	unsigned int freq = 0;
> +	bool hk_cpu;
>  	u64 scale;
>  
>  	if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
>  		return -EOPNOTSUPP;
>  
> +	scoped_guard(rcu) {
> +		hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
> +	}
> +

Should we put this into a while loop, since cpu might be changed to ref_cpu?

>  	while (1) {
>  
>  		amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> @@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
>  		 * (and thus freq scale), if available, for given policy: this boils
>  		 * down to identifying an active cpu within the same freq domain, if any.
>  		 */
> -		if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> +		if (!hk_cpu ||
>  		    time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>  			struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> +			bool hk_intersects;
>  			int ref_cpu;
>  
>  			if (!policy)
>  				return -EINVAL;
>  
> -			if (!cpumask_intersects(policy->related_cpus,
> -						housekeeping_cpumask(HK_TYPE_TICK))) {
> +			scoped_guard(rcu) {
> +				hk_intersects = cpumask_intersects(policy->related_cpus,
> +							housekeeping_cpumask(HK_TYPE_TICK));
> +			}
> +
> +			if (!hk_intersects) {
>  				cpufreq_cpu_put(policy);
>  				return -EOPNOTSUPP;
>  			}

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2026-04-22  9:35 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
2026-04-21  3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
2026-04-21  3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
2026-04-22  3:08   ` sashiko-bot
2026-04-22  6:39   ` Chen Ridong
2026-04-21  3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
2026-04-21  8:32   ` Thomas Gleixner
2026-04-21 14:14     ` Waiman Long
2026-04-22  3:08   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
2026-04-21  8:50   ` Thomas Gleixner
2026-04-21 14:24     ` Waiman Long
2026-04-22  3:08   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
2026-04-21  8:55   ` Thomas Gleixner
2026-04-21 14:22     ` Waiman Long
2026-04-21  3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
2026-04-22  3:08   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
2026-04-22  3:08   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
2026-04-22  3:08   ` sashiko-bot
2026-04-22  9:34   ` Chen Ridong
2026-04-21  3:03 ` [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 10/23] cpu: " Waiman Long
2026-04-21  8:57   ` Thomas Gleixner
2026-04-21 14:25     ` Waiman Long
2026-04-21  3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
2026-04-21  8:59   ` Thomas Gleixner
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now Waiman Long
2026-04-21  3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
2026-04-21  9:02   ` Thomas Gleixner
2026-04-21 14:29     ` Waiman Long
2026-04-21  3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
2026-04-21 16:17   ` Thomas Gleixner
2026-04-21 17:29     ` Waiman Long
2026-04-21 18:43       ` Thomas Gleixner
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update() Waiman Long
2026-04-21  3:03 ` [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
2026-04-22  3:09   ` sashiko-bot
2026-04-21  3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long
2026-04-22  3:09   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox