cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Josh Triplett" <josh@joshtriplett.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Uladzislau Rezki" <urezki@gmail.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Lai Jiangshan" <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>,
	"Anna-Maria Behnsen" <anna-maria@linutronix.de>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, rcu@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Phil Auld <pauld@redhat.com>,
	Costa Shulyupin <costa.shul@redhat.com>,
	Gabriele Monaco <gmonaco@redhat.com>,
	Cestmir Kalina <ckalina@redhat.com>,
	Waiman Long <longman@redhat.com>
Subject: [RFC PATCH 00/18] cgroup/cpuset: Enable runtime modification of
Date: Fri,  8 Aug 2025 11:10:44 -0400	[thread overview]
Message-ID: <20250808151053.19777-1-longman@redhat.com> (raw)

The "nohz_full" and "rcu_nocbs" boot command parameters can be used to
remove a lot of kernel overhead on a specific set of isolated CPUs which
can be used to run some latency/bandwidth sensitive workloads with as
little kernel disturbance/noise as possible. The problem with this mode
of operation is the fact that it is a static configuration which cannot
be changed after boot to adjust for changes in application loading.

There is always a desire to enable runtime modification of the number
of isolated CPUs that can be dedicated to this type of demanding
workloads. This patchset is an attempt to do just that with an amount of
CPU isolation close to what can be done with the nohz_full and rcu_nocbs
boot kernel parameters.

This patch series provides the ability to change the set of housekeeping
CPUs at run time via the cpuset isolated partition functionality.
Currently, the cpuset isolated partition is able to disable scheduler
load balancing and the CPU affinity of the unbound workqueue to avoid the
isolated CPUs. This patch series will extend that with other kernel noises
associated with the nohz_full boot command line parameter which has the
following sub-categories:
  - tick
  - timer
  - RCU
  - MISC
  - WQ
  - kthread

The rcu_nocbs is actually a subset of nohz_full focusing just on the
RCU part of the kernel noises. The WQ part has already been handled by
the current cpuset code.

This series focuses on the tick and RCU part of the kernel noises by
actively changing their internal data structures to track changes in
the list of isolated CPUs used by cpuset isolated partitions.

The dynamic update of the lists of housekeeping CPUs at run time will
also have impact on the other part of the kernel noises that reference
the lists of housekeeping CPUs at run time.

The pending patch series on timer migration[1], when properly integrated
will support the timer part too.

The CPU hotplug functionality of the Linux kernel is used to facilitate
the runtime change of the nohz_full isolated CPUs with minimal code
changes. The CPUs that need to be switched from non-isolated to
isolated or vice versa will be brought offline first, making the
necessary changes and then brought back online afterward.

The use of CPU hotplug, however, does have a slight drawback of
freezing all the other CPUs in part of the offlining process using
the stop machine feature of the kernel. That will cause a noticeable
latency spikes in other running applications which may be significant
to sensitive applications running on isolated CPUs in other isolated
partitions at the time. Hopefully we can find a way to solve this
problem in the future.

One possible workaround for this is to reserve a set of nohz_full
isolated CPUs at boot time using the nohz_full boot command parameter.
The bringing of those nohz_full reserved CPUs into and out of isolated
partitions will not invoke CPU hotplug and hence will not cause
unexpected latency spikes. These reserved CPUs will only be needed
if there are other existing isolated partitions running critical
applications at the time when an isolated partition needs to be created.

Patches 1-4 updates the CPU isolation code at kernel/sched/isolation.c
to enable dynamic update of the lists of housekeeping CPUs.

Patch 5 introduces a new cpuhp_offline_cb() API for shutting down the
given set of CPUs, running the given callback method and then bringing
those CPUs back online again. This new API will block any incoming
hotplug events from interfering this operation.

Patches 6-9 updates the cpuset partition code to use the new cpuhp API
to shut down the affect CPUs, making changes to the housekeeping
cpumasks and then bring those CPUs online afterward.

Patch 10 works around an issue in the DL server code that block the
hotplug operation under certain configurations.

Patch 11-14 updates the timer tick and related code to enable proper
updates to the set of CPUs requiring nohz_full dynticks support.

Patch 15 enables runtime modification to the set of isolated CPUs
requiring RCU NO-CB CPU support with minor changes to the RCU code.

Patches 16-18 includes other miscellaneous updates to cpuset code and
documentation.

This patch series is applied on top of some other cpuset patches[1]
posted upstream recently.

[1] https://lore.kernel.org/lkml/20250806093855.86469-1-gmonaco@redhat.com/
[2] https://lore.kernel.org/lkml/20250806172430.1155133-1-longman@redhat.com/

Waiman Long (18):
  sched/isolation: Enable runtime update of housekeeping cpumasks
  sched/isolation: Call sched_tick_offload_init() when
    HK_FLAG_KERNEL_NOISE is first set
  sched/isolation: Use RCU to delay successive housekeeping cpumask
    updates
  sched/isolation: Add a debugfs file to dump housekeeping cpumasks
  cpu/hotplug: Add a new cpuhp_offline_cb() API
  cgroup/cpuset: Introduce a new top level isolcpus_update_mutex
  cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask
  cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full
    modification
  cgroup/cpuset: Revert "Include isolated cpuset CPUs in
    cpu_is_isolated() check"
  sched/core: Ignore DL BW deactivation error if in
    cpuhp_offline_cb_mode
  tick/nohz: Make nohz_full parameter optional
  tick/nohz: Introduce tick_nohz_full_update_cpus() to update
    tick_nohz_full_mask
  tick/nohz: Allow runtime changes in full dynticks CPUs
  tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
  cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs
  cgroup/cpuset: Don't set have_boot_nohz_full without any boot time
    nohz_full CPU
  cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated
    partition
  cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call

 Documentation/admin-guide/cgroup-v2.rst       |  33 +-
 .../admin-guide/kernel-parameters.txt         |  19 +-
 include/linux/context_tracking.h              |   8 +-
 include/linux/cpuhplock.h                     |   9 +
 include/linux/cpuset.h                        |   6 -
 include/linux/rcupdate.h                      |   2 +
 include/linux/sched/isolation.h               |   9 +-
 include/linux/tick.h                          |   2 +
 kernel/cgroup/cpuset.c                        | 344 ++++++++++++------
 kernel/context_tracking.c                     |  21 +-
 kernel/cpu.c                                  |  47 +++
 kernel/rcu/tree_nocb.h                        |   7 +-
 kernel/sched/core.c                           |   8 +-
 kernel/sched/debug.c                          |  32 ++
 kernel/sched/isolation.c                      | 151 +++++++-
 kernel/sched/sched.h                          |   2 +-
 kernel/time/tick-common.c                     |  15 +-
 kernel/time/tick-sched.c                      |  24 +-
 .../selftests/cgroup/test_cpuset_prs.sh       |  15 +-
 19 files changed, 583 insertions(+), 171 deletions(-)

-- 
2.50.0


             reply	other threads:[~2025-08-08 15:11 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-08 15:10 Waiman Long [this message]
2025-08-08 15:10 ` [RFC PATCH 01/18] sched/isolation: Enable runtime update of housekeeping cpumasks Waiman Long
2025-08-08 15:10 ` [RFC PATCH 02/18] sched/isolation: Call sched_tick_offload_init() when HK_FLAG_KERNEL_NOISE is first set Waiman Long
2025-08-08 15:10 ` [RFC PATCH 03/18] sched/isolation: Use RCU to delay successive housekeeping cpumask updates Waiman Long
2025-08-08 15:10 ` [RFC PATCH 04/18] sched/isolation: Add a debugfs file to dump housekeeping cpumasks Waiman Long
2025-08-08 15:10 ` [RFC PATCH 05/18] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
2025-08-08 15:10 ` [RFC PATCH 06/18] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex Waiman Long
2025-08-08 15:10 ` [RFC PATCH 07/18] cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask Waiman Long
2025-08-08 15:10 ` [RFC PATCH 08/18] cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full modification Waiman Long
2025-08-08 15:10 ` [RFC PATCH 09/18] cgroup/cpuset: Revert "Include isolated cpuset CPUs in cpu_is_isolated() check" Waiman Long
2025-08-08 15:19 ` [RFC PATCH 10/18] sched/core: Ignore DL BW deactivation error if in cpuhp_offline_cb_mode Waiman Long
2025-08-08 15:19 ` [RFC PATCH 11/18] tick/nohz: Make nohz_full parameter optional Waiman Long
2025-08-08 15:19 ` [RFC PATCH 12/18] tick/nohz: Introduce tick_nohz_full_update_cpus() to update tick_nohz_full_mask Waiman Long
2025-08-08 15:19 ` [RFC PATCH 13/18] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
2025-08-08 15:19 ` [RFC PATCH 14/18] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
2025-08-08 15:19 ` [RFC PATCH 15/18] cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs Waiman Long
2025-08-08 15:19 ` [RFC PATCH 16/18] cgroup/cpuset: Don't set have_boot_nohz_full without any boot time nohz_full CPU Waiman Long
2025-08-08 15:20 ` [RFC PATCH 17/18] cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated partition Waiman Long
2025-08-08 15:20 ` [RFC PATCH 18/18] cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call Waiman Long
2025-08-08 15:50 ` [RFC PATCH 00/18] cgroup/cpuset: Enable runtime modification of Frederic Weisbecker
2025-08-08 16:27   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250808151053.19777-1-longman@redhat.com \
    --to=longman@redhat.com \
    --cc=anna-maria@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=ckalina@redhat.com \
    --cc=corbet@lwn.net \
    --cc=costa.shul@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=gmonaco@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=pauld@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).