linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Josh Triplett" <josh@joshtriplett.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Uladzislau Rezki" <urezki@gmail.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Lai Jiangshan" <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>,
	"Anna-Maria Behnsen" <anna-maria@linutronix.de>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, rcu@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Phil Auld <pauld@redhat.com>,
	Costa Shulyupin <costa.shul@redhat.com>,
	Gabriele Monaco <gmonaco@redhat.com>,
	Cestmir Kalina <ckalina@redhat.com>,
	Waiman Long <longman@redhat.com>
Subject: [RFC PATCH 08/18] cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full modification
Date: Fri,  8 Aug 2025 11:10:52 -0400	[thread overview]
Message-ID: <20250808151053.19777-9-longman@redhat.com> (raw)
In-Reply-To: <20250808151053.19777-1-longman@redhat.com>

One relatively simple way to allow runtime modification of nohz_full,
and rcu_nocbs CPUs is to use the CPU hotplug to bring the affected CPUs
offline first, making changes to the housekeeping cpumasks and then
bring them back online. However, doing this will be rather costly in
term of the number of CPU cycles needed. Still it is the easiet way to
achieve the desired result and hopefully we can gradually reduce the
overhead over time.

Use the newly introduced cpuhp_offline_cb() API to bring the affected
CPUs offline, make the necessary housekeeping cpumask changes and then
bring those CPUs back online again.

As HK_TYPE_DOMAIN cpumask is going to be updated at run time, we are
going to reset any boot time isolcpus domain setting if an isolated
partition or a conflicting non-isolated partition is going to be
created.

Since rebuild_sched_domains() will be called at the end of
update_isolation_cpumasks(), earlier rebuild_sched_domains_locked()
calls will be suppressed to avoid unneeded work.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 95 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 92 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 87e9ee7922cd..60f336e50b05 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1355,11 +1355,57 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
 	return;
 }
 
+/*
+ * We are only updating HK_TYPE_DOMAIN and HK_TYPE_KERNEL_NOISE housekeeping
+ * cpumask for now. HK_TYPE_MANAGED_IRQ will be handled later.
+ */
+static int do_housekeeping_exclude_cpumask(void *arg __maybe_unused)
+{
+	int ret;
+	struct cpumask *icpus = isolated_cpus;
+	unsigned long flags = BIT(HK_TYPE_DOMAIN) | BIT(HK_TYPE_KERNEL_NOISE);
+
+	/*
+	 * The boot time isolcpus setting will be overwritten if set.
+	 */
+	have_boot_isolcpus = false;
+
+	if (have_boot_nohz_full) {
+		/*
+		 * Need to separate the handling of HK_TYPE_KERNEL_NOISE and
+		 * HK_TYPE_DOMAIN as different cpumasks will be used for each.
+		 */
+		ret = housekeeping_exclude_cpumask(icpus, BIT(HK_TYPE_DOMAIN));
+		WARN_ON_ONCE((ret < 0) && (ret != -EOPNOTSUPP));
+
+		if (cpumask_empty(isolcpus_update_state.cpus))
+			return ret;
+		flags = BIT(HK_TYPE_KERNEL_NOISE);
+		icpus = kmalloc(cpumask_size(), GFP_KERNEL);
+		if (WARN_ON_ONCE(!icpus))
+			return -ENOMEM;
+
+		/*
+		 * Add boot time nohz_full CPUs into the isolated CPUs list
+		 * for exclusion from HK_TYPE_KERNEL_NOISE CPUs.
+		 */
+		cpumask_andnot(icpus, cpu_possible_mask, boot_nohz_full_hk_cpus);
+		cpumask_or(icpus, icpus, isolated_cpus);
+	}
+	ret = housekeeping_exclude_cpumask(icpus, flags);
+	WARN_ON_ONCE((ret < 0) && (ret != -EOPNOTSUPP));
+
+	if (icpus != isolated_cpus)
+		kfree(icpus);
+	return ret;
+}
+
 /**
  * update_isolation_cpumasks - Update external isolation CPU masks
  *
  * The following external CPU masks will be updated if necessary:
  * - workqueue unbound cpumask
+ * - housekeeping cpumasks
  */
 static void update_isolation_cpumasks(void)
 {
@@ -1371,7 +1417,41 @@ static void update_isolation_cpumasks(void)
 	ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
 	WARN_ON_ONCE(ret < 0);
 
+	 /*
+	  * Mask out offline and boot-time nohz_full non-housekeeping
+	  * CPUs from isolcpus_update_state.cpus to compute the set
+	  * of CPUs that need to be brought offline before calling
+	  * do_housekeeping_exclude_cpumask().
+	  */
+	cpumask_and(isolcpus_update_state.cpus,
+		    isolcpus_update_state.cpus, cpu_active_mask);
+	if (have_boot_nohz_full)
+		cpumask_and(isolcpus_update_state.cpus,
+			    isolcpus_update_state.cpus, boot_nohz_full_hk_cpus);
+
+	/*
+	 * Without any change in the set of nohz_full CPUs, we don't really
+	 * need to use CPU hotplug for making change in HK cpumasks.
+	 */
+	if (cpumask_empty(isolcpus_update_state.cpus))
+		ret = do_housekeeping_exclude_cpumask(NULL);
+	else
+		ret = cpuhp_offline_cb(isolcpus_update_state.cpus,
+				       do_housekeeping_exclude_cpumask, NULL);
+	/*
+	 * A errno value of -EPERM may be returned from cpuhp_offline_cb() if
+	 * any one of the CPUs in isolcpus_update_state.cpus can't be brought
+	 * offline. This can happen for the boot CPU (normally CPU 0) which
+	 * cannot be shut down. This CPU should not be used for creating
+	 * isolated partition.
+	 */
+	if (ret == -EPERM)
+		pr_warn_once("cpuset: The boot CPU shouldn't be used for isolated partition\n");
+	else
+		WARN_ON_ONCE(ret < 0);
+
 	cpumask_clear(isolcpus_update_state.cpus);
+	rebuild_sched_domains();
 	isolcpus_update_state.updating = false;
 }
 
@@ -2961,7 +3041,16 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	update_partition_sd_lb(cs, old_prs);
 
 	notify_partition_change(cs, old_prs);
-	if (force_sd_rebuild)
+
+	/*
+	 * If boot time domain isolcpus exists and it conflicts with the CPUs
+	 * in the new partition, we will have to reset HK_TYPE_DOMAIN cpumask.
+	 */
+	if (have_boot_isolcpus && (new_prs > PRS_MEMBER) &&
+	    !cpumask_subset(cs->effective_xcpus, housekeeping_cpumask(HK_TYPE_DOMAIN)))
+		isolcpus_update_state.updating = true;
+
+	if (force_sd_rebuild && !isolcpus_update_state.updating)
 		rebuild_sched_domains_locked();
 	free_cpumasks(NULL, &tmpmask);
 	return 0;
@@ -3232,7 +3321,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	}
 
 	free_cpuset(trialcs);
-	if (force_sd_rebuild)
+	if (force_sd_rebuild && !isolcpus_update_state.updating)
 		rebuild_sched_domains_locked();
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
@@ -3999,7 +4088,7 @@ static void cpuset_handle_hotplug(void)
 	}
 
 	/* rebuild sched domains if necessary */
-	if (force_sd_rebuild)
+	if (force_sd_rebuild && !isolcpus_update_state.updating)
 		rebuild_sched_domains_cpuslocked();
 
 	free_cpumasks(NULL, ptmp);
-- 
2.50.0


  parent reply	other threads:[~2025-08-08 15:12 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-08 15:10 [RFC PATCH 00/18] cgroup/cpuset: Enable runtime modification of Waiman Long
2025-08-08 15:10 ` [RFC PATCH 01/18] sched/isolation: Enable runtime update of housekeeping cpumasks Waiman Long
2025-08-08 15:10 ` [RFC PATCH 02/18] sched/isolation: Call sched_tick_offload_init() when HK_FLAG_KERNEL_NOISE is first set Waiman Long
2025-08-08 15:10 ` [RFC PATCH 03/18] sched/isolation: Use RCU to delay successive housekeeping cpumask updates Waiman Long
2025-08-08 15:10 ` [RFC PATCH 04/18] sched/isolation: Add a debugfs file to dump housekeeping cpumasks Waiman Long
2025-08-08 15:10 ` [RFC PATCH 05/18] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
2025-08-08 15:10 ` [RFC PATCH 06/18] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex Waiman Long
2025-08-08 15:10 ` [RFC PATCH 07/18] cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask Waiman Long
2025-08-08 15:10 ` Waiman Long [this message]
2025-08-08 15:10 ` [RFC PATCH 09/18] cgroup/cpuset: Revert "Include isolated cpuset CPUs in cpu_is_isolated() check" Waiman Long
2025-08-08 15:19 ` [RFC PATCH 10/18] sched/core: Ignore DL BW deactivation error if in cpuhp_offline_cb_mode Waiman Long
2025-08-08 15:19 ` [RFC PATCH 11/18] tick/nohz: Make nohz_full parameter optional Waiman Long
2025-08-08 15:19 ` [RFC PATCH 12/18] tick/nohz: Introduce tick_nohz_full_update_cpus() to update tick_nohz_full_mask Waiman Long
2025-08-08 15:19 ` [RFC PATCH 13/18] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
2025-08-08 15:19 ` [RFC PATCH 14/18] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
2025-08-08 15:19 ` [RFC PATCH 15/18] cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs Waiman Long
2025-08-08 15:19 ` [RFC PATCH 16/18] cgroup/cpuset: Don't set have_boot_nohz_full without any boot time nohz_full CPU Waiman Long
2025-08-08 15:20 ` [RFC PATCH 17/18] cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated partition Waiman Long
2025-08-08 15:20 ` [RFC PATCH 18/18] cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call Waiman Long
2025-08-08 15:50 ` [RFC PATCH 00/18] cgroup/cpuset: Enable runtime modification of Frederic Weisbecker
2025-08-08 16:27   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250808151053.19777-9-longman@redhat.com \
    --to=longman@redhat.com \
    --cc=anna-maria@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=ckalina@redhat.com \
    --cc=corbet@lwn.net \
    --cc=costa.shul@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=gmonaco@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=pauld@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).