From: Waiman Long <longman@redhat.com>
To: "Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Shuah Khan" <skhan@linuxfoundation.org>,
"Catalin Marinas" <catalin.marinas@arm.com>,
"Will Deacon" <will@kernel.org>,
"K. Y. Srinivasan" <kys@microsoft.com>,
"Haiyang Zhang" <haiyangz@microsoft.com>,
"Wei Liu" <wei.liu@kernel.org>,
"Dexuan Cui" <decui@microsoft.com>,
"Long Li" <longli@microsoft.com>,
"Guenter Roeck" <linux@roeck-us.net>,
"Frederic Weisbecker" <frederic@kernel.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
"Joel Fernandes" <joelagnelf@nvidia.com>,
"Josh Triplett" <josh@joshtriplett.org>,
"Boqun Feng" <boqun@kernel.org>,
"Uladzislau Rezki" <urezki@gmail.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
"Lai Jiangshan" <jiangshanlai@gmail.com>,
Zqiang <qiang.zhang@linux.dev>,
"Anna-Maria Behnsen" <anna-maria@linutronix.de>,
"Ingo Molnar" <mingo@kernel.org>,
"Thomas Gleixner" <tglx@kernel.org>,
"Chen Ridong" <chenridong@huaweicloud.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Juri Lelli" <juri.lelli@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
"Valentin Schneider" <vschneid@redhat.com>,
"K Prateek Nayak" <kprateek.nayak@amd.com>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Simon Horman" <horms@kernel.org>
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-hyperv@vger.kernel.org, linux-hwmon@vger.kernel.org,
rcu@vger.kernel.org, netdev@vger.kernel.org,
linux-kselftest@vger.kernel.org,
Costa Shulyupin <cshulyup@redhat.com>,
Qiliang Yuan <realwujing@gmail.com>,
Waiman Long <longman@redhat.com>
Subject: [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks
Date: Mon, 20 Apr 2026 23:03:48 -0400 [thread overview]
Message-ID: <20260421030351.281436-21-longman@redhat.com> (raw)
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
One simple way to enable runtime update of HK_TYPE_KERNEL_NOISE
(nohz_full) and HK_TYPE_MANAGED_IRQ cpumasks is to make use of the CPU
hotplug to facilitate the transition of those CPUs that are changing
states as long as CONFIG_HOTPLUG_CPU is enabled and a nohz_full boot
parameter is provided. Otherwise, only HK_TYPE_DOMAIN cpumask will be
updated at run time.
For changes in HK_TYPE_DOMAIN cpumask, it can be done without using CPU
hotplug. For changes in HK_TYPE_MANAGED_IRQ cpumask, we have to update
the cpumask first and then tear down and bring up the newly isolated
CPUs to migrate the managed irqs in those CPUs to other housekeeping
CPUs.
For changes in HK_TYPE_KERNEL_NOISE, we have to tear down all the newly
isolated and de-isolated CPUs, change the cpumask and then bring all the
offline CPUs back online.
As it is possible that the various boot versions of the housekeeping
cpumasks are different resulting in the use of different set of isolated
cpumasks for calling housekeeping_update(), we may need to pre-allocate
these cpumasks if necessary.
Note that the use of CPU hotplug to facilitate the changing of
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ housekeeping cpumasks has
the drawback that during the tear down of a CPU from CPUHP_TEARDOWN_CPU
state to CPUHP_AP_OFFLINE, the stop_machine code will be invoked to stop
all the other CPUs including all the pre-existing isolated CPUs. That
will cause latency spikes on those isolated CPUs. That latency spike
should only happen when the cpuset isolated partition setting is changed
resulting in changes in those housekeeping cpumasks.
One possible workaround that is being used right now is to pre-allocate
a set of nohz_full and managed_irq CPUs at boot time. The semi-isolated
CPUs are then used to create cpuset isolated partitions when needed
to enable full isolation. This will likely continue even if we made
the nohz_full and managed_irq CPUs runtime changeable if they can't
tolerate these latency spikes.
This is a problem we need to address in a future patch series.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 199 +++++++++++++++++++++++++++++++++++++----
1 file changed, 182 insertions(+), 17 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1b0c50b46a49..a927b9cd4f71 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -152,9 +152,17 @@ static cpumask_var_t isolated_cpus; /* CSCB */
static bool update_housekeeping; /* RWCS */
/*
- * Copy of isolated_cpus to be passed to housekeeping_update()
+ * Cpumasks to be passed to housekeeping_update()
+ * isolated_hk_cpus - copy of isolated_cpus for HK_TYPE_DOMAIN
+ * isolated_nohz_cpus - for HK_TYPE_KERNEL_NOISE
+ * isolated_mirq_cpus - for HK_TYPE_MANAGED_IRQ
*/
static cpumask_var_t isolated_hk_cpus; /* T */
+static cpumask_var_t isolated_nohz_cpus; /* T */
+static cpumask_var_t isolated_mirq_cpus; /* T */
+
+static bool boot_nohz_le_domain __ro_after_init;
+static bool boot_mirq_le_domain __ro_after_init;
/*
* A flag to force sched domain rebuild at the end of an operation.
@@ -1328,29 +1336,67 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
return false;
}
+static int cpuset_nohz_update_cbfunc(void *arg)
+{
+ struct cpumask *isol_cpus = (struct cpumask *)arg;
+
+ if (isol_cpus)
+ housekeeping_update(isol_cpus, BIT(HK_TYPE_KERNEL_NOISE));
+ return 0;
+}
+
/*
- * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
- *
- * Update housekeeping cpumasks and rebuild sched domains if necessary and
- * then do a cpuset_full_unlock().
- * This should be called at the end of cpuset operation.
*/
-static void cpuset_update_sd_hk_unlock(void)
- __releases(&cpuset_mutex)
- __releases(&cpuset_top_mutex)
+static void cpuset_update_housekeeping_unlock(void)
{
- update_housekeeping = false;
+ bool update_nohz, update_mirq;
+ cpumask_var_t cpus;
+ int ret;
- /* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
- if (force_sd_rebuild)
- rebuild_sched_domains_locked();
+ if (!tick_nohz_full_enabled())
+ return;
- if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) {
- /* No housekeeping cpumask update needed */
+ update_nohz = boot_nohz_le_domain;
+ update_mirq = boot_mirq_le_domain;
+
+ if (WARN_ON_ONCE(!alloc_cpumask_var(&cpus, GFP_KERNEL))) {
cpuset_full_unlock();
return;
}
+ /*
+ * Update isolated_nohz_cpus/isolated_mirq_cpus if necessary
+ */
+ if (!boot_nohz_le_domain) {
+ cpumask_andnot(cpus, cpu_possible_mask,
+ housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+ cpumask_or(cpus, cpus, isolated_cpus);
+ update_nohz = !cpumask_equal(isolated_nohz_cpus, cpus);
+ if (update_nohz)
+ cpumask_copy(isolated_nohz_cpus, cpus);
+ }
+ if (!boot_mirq_le_domain) {
+ cpumask_andnot(cpus, cpu_possible_mask,
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+ cpumask_or(cpus, cpus, isolated_cpus);
+ update_mirq = !cpumask_equal(isolated_mirq_cpus, cpus);
+ if (update_mirq)
+ cpumask_copy(isolated_mirq_cpus, cpus);
+ }
+
+ /*
+ * Compute list of CPUs to be brought offline into "cpus"
+ * isolated_hk_cpus - old cpumask
+ * isolated_cpus - new cpumask
+ *
+ * With update_nohz, we need to offline both the newly isolated
+ * and de-isolated CPUs. With only update_mirq, we only need to
+ * offline the new isolated CPUs.
+ */
+ if (update_nohz)
+ cpumask_xor(cpus, isolated_hk_cpus, isolated_cpus);
+ else if (update_mirq)
+ cpumask_andnot(cpus, isolated_cpus, isolated_hk_cpus);
cpumask_copy(isolated_hk_cpus, isolated_cpus);
/*
@@ -1360,10 +1406,103 @@ static void cpuset_update_sd_hk_unlock(void)
*/
mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
- WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
+
+ if (!update_mirq) {
+ ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN));
+ } else if (boot_mirq_le_domain) {
+ ret = housekeeping_update(isolated_hk_cpus,
+ BIT(HK_TYPE_DOMAIN)|BIT(HK_TYPE_MANAGED_IRQ));
+ } else {
+ ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN));
+ if (!ret)
+ ret = housekeeping_update(isolated_mirq_cpus,
+ BIT(HK_TYPE_MANAGED_IRQ));
+ }
+
+ if (WARN_ON_ONCE(ret))
+ goto out_free;
+
+ /*
+ * Calling cpuhp_offline_cb() is only needed if either
+ * HK_TYPE_KERNEL_NOISE and/or HK_TYPE_MANAGED_IRQ cpumasks
+ * needed to be updated.
+ *
+ * TODO: When tearing down a CPU from CPUHP_TEARDOWN_CPU state
+ * downward to CPUHP_AP_OFFLINE, the stop_machine code will be
+ * invoked to stop all the other CPUs in the system. This will
+ * cause latency spikes on existing isolated CPUs. We will need
+ * some new mechanism to enable us to not stop the existing
+ * isolated CPUs whenever possible. A possible workaround is
+ * to pre-allocate a set of nohz_full and manged_irq CPUs at
+ * boot time and use them to create isolated cpuset partitions
+ * so that CPU hotplug won't need to be used.
+ */
+ if (update_mirq || update_nohz) {
+ struct cpumask *nohz_cpus;
+
+ /*
+ * Calling housekeeping_update() is only needed if
+ * update_nohz is set.
+ */
+ nohz_cpus = !update_nohz ? NULL : boot_nohz_le_domain
+ ? isolated_hk_cpus
+ : isolated_nohz_cpus;
+ /*
+ * Mask out offline CPUs in cpus
+ * If there is no online CPUs, we can call
+ * housekeeping_update() directly if needed.
+ */
+ cpumask_and(cpus, cpus, cpu_online_mask);
+ if (!cpumask_empty(cpus))
+ ret = cpuhp_offline_cb(cpus, cpuset_nohz_update_cbfunc,
+ nohz_cpus);
+ else if (nohz_cpus)
+ ret = housekeeping_update(nohz_cpus, BIT(HK_TYPE_KERNEL_NOISE));
+ }
+ WARN_ON_ONCE(ret);
+out_free:
+ free_cpumask_var(cpus);
mutex_unlock(&cpuset_top_mutex);
}
+/*
+ * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
+ *
+ * Update housekeeping cpumasks and rebuild sched domains if necessary and
+ * then do a cpuset_full_unlock().
+ * This should be called at the end of cpuset operation.
+ */
+static void cpuset_update_sd_hk_unlock(void)
+ __releases(&cpuset_mutex)
+ __releases(&cpuset_top_mutex)
+{
+ /* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
+ if (force_sd_rebuild)
+ rebuild_sched_domains_locked();
+
+ update_housekeeping = false;
+
+ if (cpumask_equal(isolated_cpus, isolated_hk_cpus)) {
+ cpuset_full_unlock();
+ return;
+ }
+
+ if (!tick_nohz_full_enabled()) {
+ /*
+ * housekeeping_update() is now called without holding
+ * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
+ * is still being held for mutual exclusion.
+ */
+ mutex_unlock(&cpuset_mutex);
+ cpus_read_unlock();
+ WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus,
+ BIT(HK_TYPE_DOMAIN)));
+ mutex_unlock(&cpuset_top_mutex);
+ } else {
+ cpuset_update_housekeeping_unlock();
+ }
+}
+
/*
* Work function to invoke cpuset_update_sd_hk_unlock()
*/
@@ -3700,6 +3839,29 @@ int __init cpuset_init(void)
housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
cpumask_copy(isolated_hk_cpus, isolated_cpus);
}
+
+ /*
+ * If nohz_full and/or managed_irq cpu list, if present, is a subset
+ * of the domain cpu list, i.e. HK_TYPE_DOMAIN_BOOT cpumask is a subset
+ * of HK_TYPE_KERNEL_NOISE_BOOT/HK_TYPE_MANAGED_IRQ_BOOT cpumask, the
+ * corresponding non-boot housekeeping cpumasks will follow changes
+ * in the HK_TYPE_DOMAIN cpumask. So we don't need to use separate
+ * cpumasks to isolate them.
+ */
+ boot_nohz_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT),
+ housekeeping_cpumask(HK_TYPE_KERNEL_NOISE_BOOT));
+ boot_mirq_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT),
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ_BOOT));
+ if (!boot_nohz_le_domain) {
+ BUG_ON(!alloc_cpumask_var(&isolated_nohz_cpus, GFP_KERNEL));
+ cpumask_andnot(isolated_nohz_cpus, cpu_possible_mask,
+ housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+ }
+ if (!boot_mirq_le_domain) {
+ BUG_ON(!alloc_cpumask_var(&isolated_mirq_cpus, GFP_KERNEL));
+ cpumask_andnot(isolated_mirq_cpus, cpu_possible_mask,
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+ }
return 0;
}
@@ -3954,7 +4116,10 @@ static void cpuset_handle_hotplug(void)
*/
if (force_sd_rebuild)
rebuild_sched_domains_cpuslocked();
- if (update_housekeeping)
+ /*
+ * Don't need to update housekeeping cpumasks in cpuhp_offline_cb mode.
+ */
+ if (update_housekeeping && !cpuhp_offline_cb_mode)
queue_work(system_dfl_wq, &hk_sd_work);
free_tmpmasks(ptmp);
--
2.53.0
next prev parent reply other threads:[~2026-04-21 3:07 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
2026-04-21 3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
2026-04-21 3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
2026-04-21 8:32 ` Thomas Gleixner
2026-04-21 14:14 ` Waiman Long
2026-04-21 3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
2026-04-21 8:50 ` Thomas Gleixner
2026-04-21 14:24 ` Waiman Long
2026-04-21 3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
2026-04-21 8:55 ` Thomas Gleixner
2026-04-21 14:22 ` Waiman Long
2026-04-21 3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
2026-04-21 3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 10/23] cpu: " Waiman Long
2026-04-21 8:57 ` Thomas Gleixner
2026-04-21 14:25 ` Waiman Long
2026-04-21 3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
2026-04-21 8:59 ` Thomas Gleixner
2026-04-21 3:03 ` [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now Waiman Long
2026-04-21 3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
2026-04-21 9:02 ` Thomas Gleixner
2026-04-21 14:29 ` Waiman Long
2026-04-21 3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
2026-04-21 3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
2026-04-21 16:17 ` Thomas Gleixner
2026-04-21 17:29 ` Waiman Long
2026-04-21 18:43 ` Thomas Gleixner
2026-04-21 3:03 ` [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update() Waiman Long
2026-04-21 3:03 ` Waiman Long [this message]
2026-04-21 3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
2026-04-21 3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
2026-04-21 3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260421030351.281436-21-longman@redhat.com \
--to=longman@redhat.com \
--cc=anna-maria@linutronix.de \
--cc=boqun@kernel.org \
--cc=bsegall@google.com \
--cc=catalin.marinas@arm.com \
--cc=cgroups@vger.kernel.org \
--cc=chenridong@huaweicloud.com \
--cc=corbet@lwn.net \
--cc=cshulyup@redhat.com \
--cc=davem@davemloft.net \
--cc=decui@microsoft.com \
--cc=dietmar.eggemann@arm.com \
--cc=edumazet@google.com \
--cc=frederic@kernel.org \
--cc=haiyangz@microsoft.com \
--cc=hannes@cmpxchg.org \
--cc=horms@kernel.org \
--cc=jiangshanlai@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=kuba@kernel.org \
--cc=kys@microsoft.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-hwmon@vger.kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux@roeck-us.net \
--cc=longli@microsoft.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=mkoutny@suse.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=qiang.zhang@linux.dev \
--cc=rcu@vger.kernel.org \
--cc=realwujing@gmail.com \
--cc=rostedt@goodmis.org \
--cc=skhan@linuxfoundation.org \
--cc=tglx@kernel.org \
--cc=tj@kernel.org \
--cc=urezki@gmail.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=wei.liu@kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox