* [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets
@ 2026-06-18 3:11 Jing Wu
2026-06-18 3:11 ` [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface Jing Wu
` (12 more replies)
0 siblings, 13 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
This series introduces Dynamic Housekeeping Management (DHM) to the Linux
kernel, enabling runtime reconfiguration of kernel-noise housekeeping
(nohz_full tick suppression, RCU NOCB offloading, and managed IRQ
migration) through the existing cgroup v2 cpuset isolated partition
mechanism — no new kernel ABI required.
When a cpuset partition is set to isolated mode, the CPUs in that
partition are removed from the kernel's global housekeeping masks. The
housekeeping subsystems (tick/nohz, RCU NOCB, genirq) react via explicit
registered callbacks, applying the new masks at runtime. Destroying the
partition restores the CPUs to all housekeeping masks.
The architecture uses a per-type callback table (struct housekeeping_cbs)
with pre_validate/apply hooks, replacing the previous notifier chain.
Housekeeping cpumask pointers are RCU-protected to allow lock-free readers
during updates.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
V2 -> V3:
- Replace notifier chain with explicit per-type callback interface
(struct housekeeping_cbs with .name, .pre_validate, .apply fields).
- RCU-protect all housekeeping cpumask pointers; callers must hold
rcu_read_lock() or use housekeeping_cpumask_rcu() in apply() callbacks.
- Drop 5 patches from v2: HK_TYPE enum separation (upstream aliases are
already correct), no-op timer/hrtimer patches, kthread dead code, and
workqueue double-update.
- Fix deadlock in rcu_hk_workfn(): remove cpus_read_lock() wrapper around
remove_cpu()/add_cpu() which take cpu_hotplug_lock write side.
- Fix UAF in rcu_hk_apply(): snapshot the housekeeping cpumask inside the
work function under rcu_read_lock(), not at apply() time where the old
pointer may be freed by synchronize_rcu() before the work runs.
- Fix tick apply(): snapshot housekeeping_cpumask_rcu() under
rcu_read_lock() as required by lockdep for runtime-mutable types.
- Activate context_tracking dynamically via ct_cpu_track_user() /
ct_cpu_untrack_user() in tick apply(), eliminating the dependency on
CONFIG_CONTEXT_TRACKING_USER_FORCE flagged by tglx.
- Fix genirq apply(): snapshot HK_TYPE_MANAGED_IRQ mask under
rcu_read_lock() before the IRQ iteration loop.
- Simplify cpuset noise_types to BIT(HK_TYPE_KERNEL_NOISE) |
BIT(HK_TYPE_MANAGED_IRQ), replacing the redundant per-alias bitmask.
- housekeeping_update_types(): always use cpu_possible_mask as base
for HK_TYPE_KERNEL_NOISE, so de-isolation restores the mask to all
possible CPUs rather than leaving it at its last non-trivial value.
- Initialize watchdog_cpumask from HK_TYPE_KERNEL_NOISE (not
HK_TYPE_TIMER) at boot; keep it in sync at runtime via a new
housekeeping_cbs callback.
- Add kernel-noise selftest to test_cpuset_prs.sh, including
cpu_in_cpulist() for correct cpulist range membership detection and
nohz_full sysfs verification when CONFIG_NO_HZ_FULL is active.
- Add RCU caller fixes: sched/core (HK_TYPE_KERNEL_NOISE) and
drivers/hv (HK_TYPE_MANAGED_IRQ) are required because those types
are updated at runtime; hrtimer (HK_TYPE_TIMER) and arm64/topology
(HK_TYPE_TICK) are defensive fixes.
- Reorder patches so all subsystem callbacks are registered before the
cpuset patch that triggers housekeeping_update_types().
V1 -> V2:
- Rebrand series from DHEI to DHM (Dynamic Housekeeping Management).
- Drop custom sysfs interface entirely.
- Integrate housekeeping control into cgroup v2 cpuset isolated partition
mechanism.
- Add SMT-aware isolation constraints to prevent splitting SMT siblings.
- Add comprehensive documentation and cgroup functional selftests.
- Refactor mask transition logic to use RCU-safe handover.
v2: https://lore.kernel.org/r/20260413-wujing-dhm-v2-0-06df21caba5d@gmail.com
v1: https://lore.kernel.org/all/20260325-dhei-v12-final-v1-0-919cca23cadf@gmail.com
---
Jing Wu (13):
sched/isolation: Replace notifier chain with explicit callback interface
sched/isolation: Add housekeeping_update_types() for kernel-noise masks
sched/isolation: RCU-protect all housekeeping cpumask readers
sched/isolation: Fix RCU protection for runtime-mutable cpumask callers
cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
tick/nohz, context_tracking: Prepare for runtime nohz_full updates
rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling
genirq: Add explicit housekeeping callback for managed IRQ migration
watchdog/lockup_detector: Register housekeeping callback for kernel-noise
sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu
cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation
docs: cgroup-v2: Document kernel-noise isolation via isolated partitions
selftests/cgroup: Add kernel-noise isolation test to cpuset selftest
Documentation/admin-guide/cgroup-v2.rst | 8 +
arch/arm64/kernel/topology.c | 9 +-
drivers/hv/channel_mgmt.c | 50 +++--
include/linux/context_tracking.h | 1 +
include/linux/cpuhotplug.h | 2 +
include/linux/sched/isolation.h | 41 ++++
kernel/cgroup/cpuset.c | 23 +-
kernel/context_tracking.c | 23 +-
kernel/irq/manage.c | 86 ++++++++
kernel/rcu/tree.c | 104 +++++++++
kernel/sched/core.c | 7 +-
kernel/sched/isolation.c | 256 ++++++++++++++++++++--
kernel/time/hrtimer.c | 5 +-
kernel/time/tick-sched.c | 157 ++++++++++++-
kernel/watchdog.c | 56 ++++-
tools/testing/selftests/cgroup/test_cpuset_prs.sh | 204 ++++++++++++++++-
16 files changed, 968 insertions(+), 64 deletions(-)
---
base-commit: eb3f4b7426cfd2b79d65b7d37155480b32259a11
change-id: 20260408-wujing-dhm-8f43e2d49cd8
Best regards,
--
Jing Wu <realwujing@gmail.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 02/13] sched/isolation: Add housekeeping_update_types() for kernel-noise masks Jing Wu
` (11 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Replace the blocking notifier chain with an explicit per-type callback
table (struct housekeeping_cbs). Each subsystem registers callbacks
at initcall time; pre_validate() runs before the RCU pointer swap to
allow rejecting the update, and apply() runs after synchronize_rcu()
when the new mask is visible to readers.
The table is limited to HK_MAX_CBS (4) slots per type, sufficient for
the kernel-noise subsystems and avoiding unbounded dynamic allocation
in the update path. The interface provides deterministic callback
order and explicit registration, giving each subsystem maintainer clear
visibility into when and why its callback is invoked — unlike the
opaque priority-based dispatch of notifier chains.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/sched/isolation.h | 31 +++++++++++++++
kernel/sched/isolation.c | 87 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cf0fd03dd7a24..f362876b3ebdf 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -46,6 +46,33 @@ extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
extern int housekeeping_update(struct cpumask *isol_mask);
extern void __init housekeeping_init(void);
+/**
+ * struct housekeeping_cbs - Per-subsystem callbacks for housekeeping mask changes
+ * @name: Subsystem name for diagnostic messages
+ * @pre_validate: Run before RCU pointer swap. Return -EINVAL
+ * to reject the update.
+ * @apply: Run after synchronize_rcu(). Reconfigure subsystem
+ * state. The new mask is visible to readers.
+ *
+ * Register subsystem callbacks at initcall time.
+ * Invoke callbacks in registration order when the corresponding
+ * housekeeping mask changes. Skip types not present in the update
+ * mask.
+ *
+ * Replace the notifier-chain pattern with deterministic callback
+ * ordering.
+ */
+struct housekeeping_cbs {
+ const char *name;
+ int (*pre_validate)(enum hk_type type,
+ const struct cpumask *cur_mask,
+ const struct cpumask *new_mask);
+ void (*apply)(enum hk_type type);
+};
+
+int housekeeping_register_cbs(enum hk_type type, struct housekeeping_cbs *cbs);
+int housekeeping_unregister_cbs(enum hk_type type, struct housekeeping_cbs *cbs);
+
#else
static inline int housekeeping_any_cpu(enum hk_type type)
@@ -73,6 +100,10 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
static inline int housekeeping_update(struct cpumask *isol_mask) { return 0; }
static inline void housekeeping_init(void) { }
+static inline int housekeeping_register_cbs(enum hk_type type,
+ struct housekeeping_cbs *cbs) { return 0; }
+static inline int housekeeping_unregister_cbs(enum hk_type type,
+ struct housekeeping_cbs *cbs) { return 0; }
#endif /* CONFIG_CPU_ISOLATION */
static inline bool housekeeping_cpu(int cpu, enum hk_type type)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ef152d401fe20..aae4dff7fbfc8 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -28,6 +28,93 @@ struct housekeeping {
static struct housekeeping housekeeping;
+/*
+ * Maintain an explicit callback table indexed by housekeeping type.
+ * Invoke callbacks for affected types in deterministic order:
+ * pre_validate() before the RCU pointer swap, apply() after
+ * synchronize_rcu().
+ */
+#define HK_MAX_CBS 4
+
+static struct {
+ struct housekeeping_cbs *cbs[HK_MAX_CBS];
+ int nr;
+} housekeeping_cbs_table[HK_TYPE_MAX];
+
+/**
+ * housekeeping_register_cbs - Register explicit callbacks for a housekeeping type
+ * @type: Housekeeping type to register for
+ * @cbs: Callback structure containing pre_validate() and apply()
+ *
+ * Callbacks run in registration order when the mask for @type changes:
+ * pre_validate() before the RCU swap may reject the update; apply()
+ * after synchronize_rcu() reconfigures subsystem state.
+ *
+ * Return: 0 on success, -EINVAL if @type or @cbs is invalid,
+ * -ENOSPC if the per-type table is full.
+ */
+int housekeeping_register_cbs(enum hk_type type, struct housekeeping_cbs *cbs)
+{
+ if (type >= HK_TYPE_MAX || !cbs)
+ return -EINVAL;
+ if (housekeeping_cbs_table[type].nr >= HK_MAX_CBS)
+ return -ENOSPC;
+ housekeeping_cbs_table[type].cbs[housekeeping_cbs_table[type].nr++] = cbs;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(housekeeping_register_cbs);
+
+/**
+ * housekeeping_unregister_cbs - Remove previously registered callbacks
+ * @type: Housekeeping type
+ * @cbs: Callback structure to remove
+ *
+ * Return: 0 on success, -EINVAL if arguments are invalid,
+ * -ENOENT if @cbs was not registered.
+ */
+int housekeeping_unregister_cbs(enum hk_type type, struct housekeeping_cbs *cbs)
+{
+ int i;
+
+ if (type >= HK_TYPE_MAX || !cbs)
+ return -EINVAL;
+ for (i = 0; i < housekeeping_cbs_table[type].nr; i++) {
+ if (housekeeping_cbs_table[type].cbs[i] == cbs) {
+ housekeeping_cbs_table[type].cbs[i] =
+ housekeeping_cbs_table[type].cbs[--housekeeping_cbs_table[type].nr];
+ return 0;
+ }
+ }
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(housekeeping_unregister_cbs);
+
+static int housekeeping_pre_validate_cbs(enum hk_type type,
+ const struct cpumask *cur,
+ const struct cpumask *new)
+{
+ int i, ret;
+
+ for (i = 0; i < housekeeping_cbs_table[type].nr; i++) {
+ if (!housekeeping_cbs_table[type].cbs[i]->pre_validate)
+ continue;
+ ret = housekeeping_cbs_table[type].cbs[i]->pre_validate(type, cur, new);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
+static void housekeeping_apply_cbs(enum hk_type type)
+{
+ int i;
+
+ for (i = 0; i < housekeeping_cbs_table[type].nr; i++) {
+ if (housekeeping_cbs_table[type].cbs[i]->apply)
+ housekeeping_cbs_table[type].cbs[i]->apply(type);
+ }
+}
+
bool housekeeping_enabled(enum hk_type type)
{
return !!(READ_ONCE(housekeeping.flags) & BIT(type));
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 02/13] sched/isolation: Add housekeeping_update_types() for kernel-noise masks
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
2026-06-18 3:11 ` [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 03/13] sched/isolation: RCU-protect all housekeeping cpumask readers Jing Wu
` (10 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Introduce housekeeping_update_types(), which updates the cpumask for
each specified housekeeping type atomically using an RCU pointer swap.
For each type in @type_mask the trial mask is computed as
(base & ~isol_mask), where the base depends on the type:
- Most types use the current housekeeping cpumask as base. For
types that are only set at boot this is equivalent to the boot
mask, so trial = (boot_mask & ~isol_mask).
- HK_TYPE_KERNEL_NOISE always uses cpu_possible_mask as base. Its
semantics are "all possible CPUs minus the currently-isolated set";
using the current HK mask instead would leave it stuck at its last
non-trivial value after de-isolation, breaking subsequent isolation
cycles.
HK_TYPE_KERNEL_NOISE also supports runtime first-enable: if it was not
registered at boot (no nohz_full= on the kernel command line),
housekeeping_update_types() registers it in housekeeping.flags on the
first call. All other types must already be boot-enabled.
For each type the function validates the trial mask against
cpu_online_mask, runs registered pre_validate() callbacks (which may
reject the update), swaps all RCU cpumask pointers in a single pass,
calls synchronize_rcu(), frees the old masks, and then runs apply()
callbacks.
The existing housekeeping_update() continues to update only
HK_TYPE_DOMAIN and remains the entry point for the cpuset partition
path. housekeeping_update_types() enables the partition path to also
drive the kernel-noise types (HK_TYPE_KERNEL_NOISE,
HK_TYPE_MANAGED_IRQ) through the explicit callback interface added in
the previous patch.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/sched/isolation.h | 4 ++
kernel/sched/isolation.c | 112 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index f362876b3ebdf..eecbcbe802bd0 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -44,6 +44,8 @@ extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
extern int housekeeping_update(struct cpumask *isol_mask);
+extern int housekeeping_update_types(unsigned long type_mask,
+ struct cpumask *isol_mask);
extern void __init housekeeping_init(void);
/**
@@ -99,6 +101,8 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
}
static inline int housekeeping_update(struct cpumask *isol_mask) { return 0; }
+static inline int housekeeping_update_types(unsigned long type_mask,
+ struct cpumask *isol_mask) { return 0; }
static inline void housekeeping_init(void) { }
static inline int housekeeping_register_cbs(enum hk_type type,
struct housekeeping_cbs *cbs) { return 0; }
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index aae4dff7fbfc8..4eca18cc5e8ce 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -249,6 +249,118 @@ int housekeeping_update(struct cpumask *isol_mask)
return 0;
}
+/**
+ * housekeeping_update_types - Update housekeeping masks for specified types
+ * @type_mask: Bitmask of housekeeping types to update
+ * @isol_mask: CPUs being added to the isolation set
+ *
+ * For each type in @type_mask that was enabled at boot, compute the
+ * trial mask as (boot mask & ~@isol_mask), validate it against
+ * @cpu_online_mask, invoke pre_validate() callbacks, swap the RCU
+ * mask pointer, and run apply() callbacks after synchronize_rcu().
+ *
+ * HK_TYPE_KERNEL_NOISE also supports runtime first-enable: when an
+ * isolated cpuset partition is created without nohz_full= at boot,
+ * cpu_possible_mask is used as the initial base and the type flag is
+ * set in housekeeping.flags on the first call.
+ *
+ * Return: 0 on success, -ENOMEM on allocation failure, -EINVAL if
+ * a trial mask has no online CPUs.
+ */
+int housekeeping_update_types(unsigned long type_mask,
+ struct cpumask *isol_mask)
+{
+ struct cpumask *trials[HK_TYPE_MAX] = {};
+ struct cpumask *old_masks[HK_TYPE_MAX] = {};
+ enum hk_type type;
+ int ret = 0;
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ const struct cpumask *base;
+
+ if (type == HK_TYPE_DOMAIN_BOOT)
+ continue;
+ if (!housekeeping_enabled(type)) {
+ /*
+ * HK_TYPE_KERNEL_NOISE supports runtime first-enable
+ * for DHM isolated partitions created without nohz_full=
+ * at boot. All other types must be boot-enabled.
+ */
+ if (type != HK_TYPE_KERNEL_NOISE)
+ continue;
+ }
+
+ /*
+ * HK_TYPE_KERNEL_NOISE always uses cpu_possible_mask as its
+ * base. Its semantics are exactly "cpu_possible minus the
+ * currently-isolated set", so the base never shrinks across
+ * successive isolation/de-isolation cycles. If we used the
+ * current HK mask instead, de-isolating all partitions would
+ * leave the mask at its last non-trivial value rather than
+ * reverting to cpu_possible, breaking subsequent isolations.
+ */
+ if (type == HK_TYPE_KERNEL_NOISE)
+ base = cpu_possible_mask;
+ else
+ base = housekeeping_cpumask(type);
+ trials[type] = kmalloc(cpumask_size(), GFP_KERNEL);
+ if (!trials[type]) {
+ ret = -ENOMEM;
+ goto err_free;
+ }
+ cpumask_andnot(trials[type], base, isol_mask);
+ if (!cpumask_intersects(trials[type], cpu_online_mask)) {
+ ret = -EINVAL;
+ goto err_free;
+ }
+ }
+
+ if (!housekeeping.flags) {
+ ret = -EINVAL;
+ goto err_free;
+ }
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ if (!trials[type])
+ continue;
+ ret = housekeeping_pre_validate_cbs(type,
+ housekeeping_cpumask(type),
+ trials[type]);
+ if (ret < 0)
+ goto err_free;
+ }
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ if (!trials[type])
+ continue;
+ old_masks[type] = housekeeping_cpumask_dereference(type);
+ /* First-time runtime enable: register the type now. */
+ if (!housekeeping_enabled(type))
+ WRITE_ONCE(housekeeping.flags,
+ housekeeping.flags | BIT(type));
+ rcu_assign_pointer(housekeeping.cpumasks[type], trials[type]);
+ trials[type] = NULL;
+ }
+
+ synchronize_rcu();
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ if (housekeeping_cbs_table[type].nr == 0)
+ continue;
+ housekeeping_apply_cbs(type);
+ }
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX)
+ kfree(old_masks[type]);
+
+ return 0;
+
+err_free:
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX)
+ kfree(trials[type]);
+ return ret;
+}
+
void __init housekeeping_init(void)
{
enum hk_type type;
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 03/13] sched/isolation: RCU-protect all housekeeping cpumask readers
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
2026-06-18 3:11 ` [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface Jing Wu
2026-06-18 3:11 ` [PATCH v3 02/13] sched/isolation: Add housekeeping_update_types() for kernel-noise masks Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 04/13] sched/isolation: Fix RCU protection for runtime-mutable cpumask callers Jing Wu
` (9 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Extend housekeeping_dereference_check() to validate all runtime-mutable
types (HK_TYPE_DOMAIN, HK_TYPE_KERNEL_NOISE, HK_TYPE_MANAGED_IRQ), not
only HK_TYPE_DOMAIN. Boot-only types (HK_TYPE_DOMAIN_BOOT) remain
unchecked.
Add housekeeping_cpumask_rcu() for callers that already hold an RCU
read lock. This variant uses rcu_dereference() without the lockdep
annotation, avoiding false-positive lockdep warnings in RCU read-side
critical sections.
Use READ_ONCE() consistently when testing housekeeping.flags in paths
that may race with housekeeping_update_types().
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/sched/isolation.h | 6 +++++
kernel/sched/isolation.c | 57 +++++++++++++++++++++++++++++++----------
2 files changed, 49 insertions(+), 14 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index eecbcbe802bd0..ed6e1c6980131 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -40,6 +40,7 @@ enum hk_type {
DECLARE_STATIC_KEY_FALSE(housekeeping_overridden);
extern int housekeeping_any_cpu(enum hk_type type);
extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
+extern const struct cpumask *housekeeping_cpumask_rcu(enum hk_type type);
extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
@@ -87,6 +88,11 @@ static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
return cpu_possible_mask;
}
+static inline const struct cpumask *housekeeping_cpumask_rcu(enum hk_type type)
+{
+ return cpu_possible_mask;
+}
+
static inline bool housekeeping_enabled(enum hk_type type)
{
return false;
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 4eca18cc5e8ce..3d5d3f12853c7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -121,25 +121,40 @@ bool housekeeping_enabled(enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_enabled);
+/*
+ * Types that can change at runtime via cpuset isolated partitions.
+ * Boot-only types (DOMAIN_BOOT) are always safe to read without lockdep.
+ */
+static bool housekeeping_type_can_change(enum hk_type type)
+{
+ switch (type) {
+ case HK_TYPE_DOMAIN:
+ case HK_TYPE_KERNEL_NOISE:
+ case HK_TYPE_MANAGED_IRQ:
+ return true;
+ default:
+ return false;
+ }
+}
+
static bool housekeeping_dereference_check(enum hk_type type)
{
- if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
- /* Cpuset isn't even writable yet? */
- if (system_state <= SYSTEM_SCHEDULING)
- return true;
+ if (!IS_ENABLED(CONFIG_LOCKDEP) || !housekeeping_type_can_change(type))
+ return true;
- /* CPU hotplug write locked, so cpuset partition can't be overwritten */
- if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
- return true;
+ /* Cpuset isn't even writable yet? */
+ if (system_state <= SYSTEM_SCHEDULING)
+ return true;
- /* Cpuset lock held, partitions not writable */
- if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
- return true;
+ /* CPU hotplug write locked, so cpuset partition can't be overwritten */
+ if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
+ return true;
- return false;
- }
+ /* Cpuset lock held, partitions not writable */
+ if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
+ return true;
- return true;
+ return false;
}
static inline struct cpumask *housekeeping_cpumask_dereference(enum hk_type type)
@@ -162,12 +177,26 @@ const struct cpumask *housekeeping_cpumask(enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_cpumask);
+const struct cpumask *housekeeping_cpumask_rcu(enum hk_type type)
+{
+ const struct cpumask *mask = NULL;
+
+ if (static_branch_unlikely(&housekeeping_overridden)) {
+ if (READ_ONCE(housekeeping.flags) & BIT(type))
+ mask = rcu_dereference(housekeeping.cpumasks[type]);
+ }
+ if (!mask)
+ mask = cpu_possible_mask;
+ return mask;
+}
+EXPORT_SYMBOL_GPL(housekeeping_cpumask_rcu);
+
int housekeeping_any_cpu(enum hk_type type)
{
int cpu;
if (static_branch_unlikely(&housekeeping_overridden)) {
- if (housekeeping.flags & BIT(type)) {
+ if (READ_ONCE(housekeeping.flags) & BIT(type)) {
cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
if (cpu < nr_cpu_ids)
return cpu;
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 04/13] sched/isolation: Fix RCU protection for runtime-mutable cpumask callers
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (2 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 03/13] sched/isolation: RCU-protect all housekeeping cpumask readers Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths Jing Wu
` (8 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
housekeeping_update_types() installs new cpumasks via rcu_assign_pointer()
and frees the old ones after synchronize_rcu(); callers that dereference
the old pointer without holding an RCU read lock can access freed memory.
Fix the four call sites:
kernel/sched/core.c (get_nohz_timer_target, HK_TYPE_KERNEL_NOISE):
The guard(rcu)() was acquired after housekeeping_cpumask(). Move it
before the call and switch to housekeeping_cpumask_rcu() so hk_mask
is read inside the RCU read-side critical section. HK_TYPE_KERNEL_NOISE
is updated at runtime by housekeeping_update_types(); this fix is
required for correctness.
drivers/hv/channel_mgmt.c (init_vp_index, HK_TYPE_MANAGED_IRQ):
The function stored the raw pointer in a local variable and used it
across GFP_KERNEL allocations (which can sleep, so an RCU read lock
cannot span them). Allocate both cpumask_var_t buffers first, then
snapshot the housekeeping mask under a brief rcu_read_lock() and use
the snapshot throughout. HK_TYPE_MANAGED_IRQ is updated at runtime;
this fix is required for correctness.
kernel/time/hrtimer.c (get_target_base, HK_TYPE_TIMER):
cpumask_any_and() against housekeeping_cpumask(HK_TYPE_TIMER) was
called without any lock. Wrap with rcu_read_lock()/rcu_read_unlock()
and use housekeeping_cpumask_rcu(). HK_TYPE_TIMER is not changed at
runtime in this series; this is a defensive fix to satisfy the
housekeeping_dereference_check() lockdep annotation for future-proofing.
hrtimers_cpu_dying() is already safe: it runs under the cpu_hotplug_lock
write side, which housekeeping_dereference_check() already permits.
arch/arm64/kernel/topology.c (arch_freq_get_on_cpu, HK_TYPE_TICK):
cpumask_intersects() against housekeeping_cpumask(HK_TYPE_TICK) was
called without any lock. Evaluate under rcu_read_lock() and store
the boolean result before releasing the lock. HK_TYPE_TICK is not
changed at runtime in this series; this is a defensive fix.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
arch/arm64/kernel/topology.c | 9 ++++++--
drivers/hv/channel_mgmt.c | 50 ++++++++++++++++++++++++++++++--------------
kernel/sched/core.c | 3 +--
kernel/time/hrtimer.c | 5 ++++-
4 files changed, 46 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index b32f13358fbb1..8f4329b57cea7 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -212,8 +212,13 @@ int arch_freq_get_on_cpu(int cpu)
if (!policy)
return -EINVAL;
- if (!cpumask_intersects(policy->related_cpus,
- housekeeping_cpumask(HK_TYPE_TICK))) {
+ bool no_hk_in_policy;
+
+ rcu_read_lock();
+ no_hk_in_policy = !cpumask_intersects(policy->related_cpus,
+ housekeeping_cpumask_rcu(HK_TYPE_TICK));
+ rcu_read_unlock();
+ if (no_hk_in_policy) {
cpufreq_cpu_put(policy);
return -EOPNOTSUPP;
}
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 84eb0a6a0b546..fc5247e92e1b3 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -750,26 +750,43 @@ static void init_vp_index(struct vmbus_channel *channel)
{
bool perf_chn = hv_is_perf_channel(channel);
u32 i, ncpu = num_online_cpus();
- cpumask_var_t available_mask;
+ cpumask_var_t available_mask, hk_snap;
struct cpumask *allocated_mask;
- const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
u32 target_cpu;
int numa_node;
- if (!perf_chn ||
- !alloc_cpumask_var(&available_mask, GFP_KERNEL) ||
- cpumask_empty(hk_mask)) {
- /*
- * If the channel is not a performance critical
- * channel, bind it to VMBUS_CONNECT_CPU.
- * In case alloc_cpumask_var() fails, bind it to
- * VMBUS_CONNECT_CPU.
- * If all the cpus are isolated, bind it to
- * VMBUS_CONNECT_CPU.
- */
+ if (!perf_chn) {
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ return;
+ }
+
+ if (!alloc_cpumask_var(&available_mask, GFP_KERNEL)) {
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
+ return;
+ }
+
+ /*
+ * Snapshot HK_TYPE_MANAGED_IRQ cpumask under RCU read lock.
+ * housekeeping_update_types() frees the old cpumask after
+ * synchronize_rcu(), so we must not hold the pointer beyond an
+ * RCU read-side critical section.
+ */
+ if (!alloc_cpumask_var(&hk_snap, GFP_KERNEL)) {
+ free_cpumask_var(available_mask);
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
+ return;
+ }
+ rcu_read_lock();
+ cpumask_copy(hk_snap, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
+ rcu_read_unlock();
+
+ if (cpumask_empty(hk_snap)) {
+ free_cpumask_var(hk_snap);
+ free_cpumask_var(available_mask);
channel->target_cpu = VMBUS_CONNECT_CPU;
- if (perf_chn)
- hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
+ hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
return;
}
@@ -788,7 +805,7 @@ static void init_vp_index(struct vmbus_channel *channel)
retry:
cpumask_xor(available_mask, allocated_mask, cpumask_of_node(numa_node));
- cpumask_and(available_mask, available_mask, hk_mask);
+ cpumask_and(available_mask, available_mask, hk_snap);
if (cpumask_empty(available_mask)) {
/*
@@ -809,6 +826,7 @@ static void init_vp_index(struct vmbus_channel *channel)
channel->target_cpu = target_cpu;
+ free_cpumask_var(hk_snap);
free_cpumask_var(available_mask);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c69..371b509d92164 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1272,9 +1272,8 @@ int get_nohz_timer_target(void)
default_cpu = cpu;
}
- hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
-
guard(rcu)();
+ hk_mask = housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE);
for_each_domain(cpu, sd) {
for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5bd6efe598f0f..18e17a9dad67b 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -242,8 +242,11 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
{
if (!hrtimer_base_is_online(base)) {
- int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+ int cpu;
+ rcu_read_lock();
+ cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask_rcu(HK_TYPE_TIMER));
+ rcu_read_unlock();
return &per_cpu(hrtimer_bases, cpu);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (3 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 04/13] sched/isolation: Fix RCU protection for runtime-mutable cpumask callers Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 16:06 ` Thomas Gleixner
2026-06-18 3:11 ` [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates Jing Wu
` (7 subsequent siblings)
12 siblings, 1 reply; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Add CPUHP_AP_NO_HZ_FULL_DYING and CPUHP_AP_IRQ_AFFINITY_DYING to the
cpuhp_state enum. These dying callbacks are invoked during CPU offline
before the tick is stopped, enabling clean tick handover and managed
IRQ migration when a CPU transitions between isolated and housekeeping
states.
The existing CPUHP_AP_IRQ_AFFINITY_ONLINE already handles managed IRQ
restoration on CPU online. The new dying callback completes the pair,
migrating managed interrupts away from the CPU before it goes down.
Subsequent patches register handlers for these states.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/cpuhotplug.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 22ba327ec2278..075cfa8161334 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -186,6 +186,8 @@ enum cpuhp_state {
CPUHP_AP_SMPCFD_DYING,
CPUHP_AP_HRTIMERS_DYING,
CPUHP_AP_TICK_DYING,
+ CPUHP_AP_IRQ_AFFINITY_DYING,
+ CPUHP_AP_NO_HZ_FULL_DYING,
CPUHP_AP_X86_TBOOT_DYING,
CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
CPUHP_AP_ONLINE,
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (4 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 17:27 ` Thomas Gleixner
2026-06-18 3:11 ` [PATCH v3 07/13] rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling Jing Wu
` (6 subsequent siblings)
12 siblings, 1 reply; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Remove __init from ct_cpu_track_user() and __initdata from the
initialized flag so context tracking can be activated on CPUs that
join nohz_full at runtime. Drop the __ro_after_init attribute from
the context_tracking_key static key, allowing static_branch_dec()
when a CPU leaves nohz_full.
Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
the static key and clearing the per-CPU tracking state.
Register a housekeeping_cbs for HK_TYPE_KERNEL_NOISE that:
- pre_validate: checks CONFIG_NO_HZ_FULL is available.
- apply: snapshots the new HK_TYPE_KERNEL_NOISE mask under an RCU
read lock (the lockdep annotation in housekeeping_cpumask() requires
this even after synchronize_rcu() completes), computes nohz_full as
the complement of the housekeeping mask, then under tick_nohz_lock:
- Activates context tracking (ct_cpu_track_user()) on CPUs newly
added to nohz_full, and deactivates it (ct_cpu_untrack_user()) on
CPUs returning to the housekeeping set. This activates the
context_tracking_key static key dynamically, eliminating the
need for CONFIG_CONTEXT_TRACKING_USER_FORCE.
- Updates tick_nohz_full_mask in-place (legacy EXPORT_SYMBOL_GPL
snapshot, eventually consistent).
- Migrates tick_do_timer_cpu if it moved into the isolated set.
- Kicks all CPUs to re-evaluate tick behaviour.
When CONFIG_CONTEXT_TRACKING_USER_FORCE is enabled and nohz_full= is
given at boot, tick_nohz_init() now calls context_tracking_init()
before iterating over tick_nohz_full_mask to call ct_cpu_track_user().
This ensures the per-CPU tracking state is set up before any CPU is
tracked, which is also required for CPUs later added to nohz_full at
runtime via DHM isolated partitions.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/context_tracking.h | 1 +
kernel/context_tracking.c | 23 ++----
kernel/time/tick-sched.c | 157 +++++++++++++++++++++++++++++++++++++--
3 files changed, 161 insertions(+), 20 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a09225..632cfc97b5b22 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -12,6 +12,7 @@
#ifdef CONFIG_CONTEXT_TRACKING_USER
extern void ct_cpu_track_user(int cpu);
+extern void ct_cpu_untrack_user(int cpu);
/* Called with interrupts disabled. */
extern void __ct_user_enter(enum ctx_state state);
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a743e7ffa6c00..e68fb02b25ad4 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
-DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
+DEFINE_STATIC_KEY_FALSE(context_tracking_key);
EXPORT_SYMBOL_GPL(context_tracking_key);
static noinstr bool context_tracking_recursion_enter(void)
@@ -674,28 +674,21 @@ void user_exit_callable(void)
}
NOKPROBE_SYMBOL(user_exit_callable);
-void __init ct_cpu_track_user(int cpu)
+void ct_cpu_track_user(int cpu)
{
- static __initdata bool initialized = false;
-
if (!per_cpu(context_tracking.active, cpu)) {
per_cpu(context_tracking.active, cpu) = true;
static_branch_inc(&context_tracking_key);
}
+}
- if (initialized)
+void ct_cpu_untrack_user(int cpu)
+{
+ if (!per_cpu(context_tracking.active, cpu))
return;
-#ifdef CONFIG_HAVE_TIF_NOHZ
- /*
- * Set TIF_NOHZ to init/0 and let it propagate to all tasks through fork
- * This assumes that init is the only task at this early boot stage.
- */
- set_tsk_thread_flag(&init_task, TIF_NOHZ);
-#endif
- WARN_ON_ONCE(!tasklist_empty());
-
- initialized = true;
+ per_cpu(context_tracking.active, cpu) = false;
+ static_branch_dec(&context_tracking_key);
}
#ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index cbbb87a0c6e7c..a7fe097042f7d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -26,6 +26,7 @@
#include <linux/irq_work.h>
#include <linux/posix-timers.h>
#include <linux/context_tracking.h>
+#include <linux/sched/isolation.h>
#include <linux/mm.h>
#include <asm/irq_regs.h>
@@ -653,11 +654,6 @@ void __init tick_nohz_init(void)
if (!tick_nohz_full_running)
return;
- /*
- * Full dynticks uses IRQ work to drive the tick rescheduling on safe
- * locking contexts. But then we need IRQ work to raise its own
- * interrupts to avoid circular dependency on the tick.
- */
if (!arch_irq_work_has_interrupt()) {
pr_warn("NO_HZ: Can't run full dynticks because arch doesn't support IRQ work self-IPIs\n");
cpumask_clear(tick_nohz_full_mask);
@@ -676,6 +672,16 @@ void __init tick_nohz_init(void)
}
}
+ /*
+ * Pre-initialize context tracking for all possible CPUs so
+ * ctx tracking is already active when a CPU is later added to
+ * nohz_full at runtime. The tracking overhead is negligible
+ * because the static key is not incremented yet — only per-CPU
+ * tracking state is set up.
+ */
+ if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER_FORCE))
+ context_tracking_init();
+
for_each_cpu(cpu, tick_nohz_full_mask)
ct_cpu_track_user(cpu);
@@ -686,6 +692,147 @@ void __init tick_nohz_init(void)
pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
cpumask_pr_args(tick_nohz_full_mask));
}
+
+static int tick_nohz_hk_validate(enum hk_type type,
+ const struct cpumask *cur_mask,
+ const struct cpumask *new_mask)
+{
+ if (!IS_ENABLED(CONFIG_NO_HZ_FULL))
+ return -EOPNOTSUPP;
+ return 0;
+}
+
+static void tick_nohz_hk_apply(enum hk_type type)
+{
+ static DEFINE_SPINLOCK(tick_nohz_lock);
+ cpumask_var_t nohz_full, added, removed;
+ bool was_running;
+ int cpu;
+
+ if (!alloc_cpumask_var(&nohz_full, GFP_KERNEL))
+ return;
+ if (!alloc_cpumask_var(&added, GFP_KERNEL)) {
+ free_cpumask_var(nohz_full);
+ return;
+ }
+ if (!alloc_cpumask_var(&removed, GFP_KERNEL)) {
+ free_cpumask_var(added);
+ free_cpumask_var(nohz_full);
+ return;
+ }
+
+ /*
+ * Snapshot the new HK_TYPE_KERNEL_NOISE mask under an RCU read lock.
+ * housekeeping_update_types() completes synchronize_rcu() before
+ * invoking apply(), so the new pointer is stable; however the lockdep
+ * annotation in housekeeping_cpumask() still requires an RCU read-side
+ * critical section for runtime-mutable types.
+ */
+ rcu_read_lock();
+ cpumask_andnot(nohz_full, cpu_possible_mask,
+ housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE));
+ rcu_read_unlock();
+
+ /*
+ * When "nohz_full=" was not passed at boot, tick_nohz_full_running is
+ * false and the full dynticks infrastructure (sched_tick_offload_init,
+ * RCU nohz quiescent-state reporting, context-tracking bootstrap) was
+ * never initialised. In that case restrict the update to
+ * tick_nohz_full_mask so the /sys/devices/system/cpu/nohz_full sysfs
+ * attribute reflects DHM-isolated CPUs without enabling tick
+ * suppression, context tracking, or timer migration – all of which
+ * require boot-time setup and would deadlock on the first
+ * synchronize_rcu() call after CPUs are offlined.
+ */
+ was_running = READ_ONCE(tick_nohz_full_running);
+
+ spin_lock(&tick_nohz_lock);
+
+ /*
+ * When nohz_full= was active at boot, compute the delta and update
+ * context tracking for CPUs joining or leaving the nohz_full set.
+ * Skip when !was_running: ct_cpu_track_user() calls
+ * static_branch_inc() which may sleep (jump_label_update on the
+ * 0→1 transition) – illegal inside a spinlock.
+ */
+ if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER) &&
+ was_running &&
+ cpumask_available(tick_nohz_full_mask)) {
+ cpumask_andnot(added, nohz_full, tick_nohz_full_mask);
+ cpumask_andnot(removed, tick_nohz_full_mask, nohz_full);
+ for_each_cpu(cpu, added)
+ ct_cpu_track_user(cpu);
+ for_each_cpu(cpu, removed)
+ ct_cpu_untrack_user(cpu);
+ }
+
+ /*
+ * Update tick_nohz_full_mask unconditionally: this is the snapshot
+ * read by the /sys/devices/system/cpu/nohz_full sysfs attribute and
+ * must reflect the current isolation set even in the DHM runtime case.
+ */
+ if (cpumask_available(tick_nohz_full_mask))
+ cpumask_copy(tick_nohz_full_mask, nohz_full);
+
+ /*
+ * Only modify tick_nohz_full_running and migrate the global tick when
+ * nohz_full= was set at boot; without boot-time setup, setting
+ * tick_nohz_full_running would suppress ticks on isolated CPUs and
+ * prevent RCU quiescent-state reporting, causing synchronize_rcu()
+ * to stall permanently when a CPU is subsequently offlined.
+ */
+ if (was_running) {
+ tick_nohz_full_running = !cpumask_empty(nohz_full);
+
+ if (tick_nohz_full_running) {
+ cpu = READ_ONCE(tick_do_timer_cpu);
+ if (cpu < nr_cpu_ids &&
+ !housekeeping_test_cpu(cpu, HK_TYPE_KERNEL_NOISE)) {
+ int new_cpu;
+
+ new_cpu = housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE);
+ if (new_cpu < nr_cpu_ids)
+ WRITE_ONCE(tick_do_timer_cpu, new_cpu);
+ }
+ }
+ }
+
+ spin_unlock(&tick_nohz_lock);
+
+ if (was_running)
+ tick_nohz_full_kick_all();
+ free_cpumask_var(removed);
+ free_cpumask_var(added);
+ free_cpumask_var(nohz_full);
+}
+
+static struct housekeeping_cbs tick_nohz_hk_cbs = {
+ .name = "tick/nohz",
+ .pre_validate = tick_nohz_hk_validate,
+ .apply = tick_nohz_hk_apply,
+};
+
+static int __init tick_nohz_hk_init_late(void)
+{
+ int ret;
+
+ /*
+ * Ensure tick_nohz_full_mask is allocated so that tick_nohz_hk_apply()
+ * can update it (and the /sys/devices/system/cpu/nohz_full sysfs
+ * attribute) when CPUs are isolated at runtime via DHM. If "nohz_full="
+ * was passed at boot the mask is already allocated; allocate an empty
+ * one here for the runtime-only case.
+ */
+ if (!cpumask_available(tick_nohz_full_mask) &&
+ !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL))
+ pr_warn("tick/nohz: failed to allocate nohz_full_mask for DHM\n");
+
+ ret = housekeeping_register_cbs(HK_TYPE_KERNEL_NOISE, &tick_nohz_hk_cbs);
+ if (ret)
+ pr_warn("tick/nohz: Failed to register hk callback: %d\n", ret);
+ return 0;
+}
+late_initcall(tick_nohz_hk_init_late);
#endif /* #ifdef CONFIG_NO_HZ_FULL */
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 07/13] rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (5 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration Jing Wu
` (5 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Register a housekeeping callback for HK_TYPE_KERNEL_NOISE. When the
mask changes, schedule asynchronous work to iterate all possible CPUs
and toggle NOCB mode for CPUs whose state disagrees with the new mask.
CPUs in the housekeeping set are de-offloaded; isolated CPUs are
offloaded.
Use CPU hotplug (remove_cpu() / add_cpu()) because
rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() require the target
CPU to be offline. The hotplug cycle takes the CPU fully offline to
quiesce its RCU state before toggling the NOCB flag, then brings it
back. Skip CPUs whose state already matches to avoid unnecessary
hotplug churn. Only bring a CPU back online if it was online before
the state change (was_online guard avoids add_cpu() on a CPU that was
already offline).
This differs from Frederic Weisbecker's suggestion to "assume the CPU
is offline" within the RCU subsystem and toggle NOCB without a full
hotplug cycle. The full hotplug approach was chosen for v3 because
rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() are the existing
stable interfaces and the "assume offline" path would require adding
new internal RCU APIs. This is a known limitation that may be
addressed by RCU maintainers in follow-up work.
Snapshot the current HK_TYPE_KERNEL_NOISE cpumask inside the work
function under an RCU read lock rather than caching the pointer at
apply() time. Caching at apply() time would create a use-after-free
hazard: a subsequent housekeeping_update_types() call frees the old
cpumask after synchronize_rcu() but before the work function runs.
Remove the cpus_read_lock() / cpus_read_unlock() pair that wrapped the
hotplug loop. remove_cpu() and add_cpu() acquire the cpu_hotplug_lock
write side; holding the read side via cpus_read_lock() before calling
them causes a deadlock.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
kernel/rcu/tree.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 104 insertions(+)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 55df6d37145e8..214ce940f501b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4929,3 +4929,107 @@ void __init rcu_init(void)
#include "tree_exp.h"
#include "tree_nocb.h"
#include "tree_plugin.h"
+
+#ifdef CONFIG_RCU_NOCB_CPU
+/*
+ * RCU NOCB runtime toggle via housekeeping callback.
+ * Schedule the CPU-hotplug work asynchronously because
+ * remove_cpu() and add_cpu() must not be called while holding
+ * cpuset_top_mutex (the hk callback context).
+ *
+ * Snapshot the current HK_TYPE_KERNEL_NOISE cpumask inside the work
+ * function under an RCU read lock to avoid caching a pointer at
+ * apply() time that could be freed before the work runs.
+ */
+struct rcu_hk_work {
+ struct work_struct work;
+};
+
+static void rcu_hk_workfn(struct work_struct *w)
+{
+ struct rcu_hk_work *hw = container_of(w, struct rcu_hk_work, work);
+ cpumask_var_t hk_mask;
+ int cpu, ret;
+
+ if (!alloc_cpumask_var(&hk_mask, GFP_KERNEL)) {
+ kfree(hw);
+ return;
+ }
+
+ rcu_read_lock();
+ cpumask_copy(hk_mask, housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE));
+ rcu_read_unlock();
+
+ for_each_possible_cpu(cpu) {
+ bool should_offload = !cpumask_test_cpu(cpu, hk_mask);
+ bool is_offloaded;
+ bool was_online;
+
+ if (!cpumask_available(rcu_nocb_mask)) {
+ is_offloaded = false;
+ } else {
+ is_offloaded = cpumask_test_cpu(cpu, rcu_nocb_mask);
+ }
+
+ if (should_offload == is_offloaded)
+ continue;
+
+ was_online = cpu_online(cpu);
+ if (was_online) {
+ ret = remove_cpu(cpu);
+ if (ret)
+ continue;
+ }
+ if (should_offload)
+ rcu_nocb_cpu_offload(cpu);
+ else
+ rcu_nocb_cpu_deoffload(cpu);
+ if (was_online)
+ add_cpu(cpu);
+ }
+
+ free_cpumask_var(hk_mask);
+ kfree(hw);
+}
+
+static void rcu_hk_apply(enum hk_type type)
+{
+ struct rcu_hk_work *hw;
+
+ if (!cpumask_available(rcu_nocb_mask))
+ return;
+
+ hw = kmalloc(sizeof(*hw), GFP_KERNEL);
+ if (!hw)
+ return;
+
+ INIT_WORK(&hw->work, rcu_hk_workfn);
+ schedule_work(&hw->work);
+}
+
+static int rcu_hk_validate(enum hk_type type,
+ const struct cpumask *cur_mask,
+ const struct cpumask *new_mask)
+{
+ if (!IS_ENABLED(CONFIG_RCU_NOCB_CPU))
+ return -EOPNOTSUPP;
+ return 0;
+}
+
+static struct housekeeping_cbs rcu_hk_cbs = {
+ .name = "rcu/nocb",
+ .pre_validate = rcu_hk_validate,
+ .apply = rcu_hk_apply,
+};
+
+static int __init rcu_hk_init(void)
+{
+ int ret;
+
+ ret = housekeeping_register_cbs(HK_TYPE_KERNEL_NOISE, &rcu_hk_cbs);
+ if (ret)
+ pr_info("rcu/nocb: runtime NOCB toggle disabled (%d)\n", ret);
+ return 0;
+}
+late_initcall(rcu_hk_init);
+#endif /* CONFIG_RCU_NOCB_CPU */
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (6 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 07/13] rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 09/13] watchdog/lockup_detector: Register housekeeping callback for kernel-noise Jing Wu
` (4 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Register a housekeeping callback for HK_TYPE_MANAGED_IRQ. When the
mask changes, iterate all active managed interrupts, intersect their
current affinity mask with the new housekeeping mask, and re-apply
with irq_do_set_affinity(). Managed interrupts on CPUs removed from
the housekeeping set are migrated to remaining housekeeping CPUs.
Only managed interrupts (IRQF_AFFINITY_MANAGED) are selected because
the kernel owns their affinity; user-controlled IRQ affinities must
not be overridden by the housekeeping layer.
The new HK_TYPE_MANAGED_IRQ cpumask is snapshotted once under an RCU
read lock before the IRQ loop, satisfying the lockdep annotation in
housekeeping_cpumask() for runtime-mutable types.
When the intersection of the IRQ's current affinity and the new
housekeeping mask is non-empty, irq_do_set_affinity() moves the IRQ
to the restricted set. If the intersection is empty (all CPUs that
were serving this IRQ are now isolated), the affinity update is skipped
and the IRQ continues to run on the isolated CPU temporarily. Full
support for the IRQ shutdown / re-startup path (when all serving CPUs
become isolated) is left for follow-up work.
Guarded by irq_lock_sparse() and per-descriptor raw_spin_lock to
prevent races with concurrent affinity changes.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
kernel/irq/manage.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 86 insertions(+)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2e80724378267..ea97f455eab2a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2801,3 +2801,89 @@ bool irq_check_status_bit(unsigned int irq, unsigned int bitmask)
return res;
}
EXPORT_SYMBOL_GPL(irq_check_status_bit);
+
+/*
+ * Managed IRQ housekeeping callback: iterate all managed IRQs and ask
+ * the chip to move them off CPUs newly removed from HK_TYPE_MANAGED_IRQ.
+ */
+static void irq_hk_apply(enum hk_type type)
+{
+ cpumask_var_t hk_mask;
+ struct irq_desc *desc;
+ unsigned int irq;
+
+ if (!alloc_cpumask_var(&hk_mask, GFP_KERNEL))
+ return;
+
+ /*
+ * Snapshot the new HK_TYPE_MANAGED_IRQ mask under an RCU read lock
+ * before iterating IRQ descriptors. The lockdep annotation in
+ * housekeeping_cpumask() requires an RCU read-side critical section
+ * for runtime-mutable types.
+ */
+ rcu_read_lock();
+ cpumask_copy(hk_mask, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
+ rcu_read_unlock();
+
+ irq_lock_sparse();
+
+ for_each_active_irq(irq) {
+ desc = irq_to_desc(irq);
+ if (!desc || !desc->action)
+ continue;
+
+ /*
+ * Only managed interrupts are selected: they have
+ * IRQF_AFFINITY_MANAGED set, meaning the kernel owns their
+ * affinity. User-controlled IRQs are intentionally skipped.
+ *
+ * When the intersection of the current affinity mask and the
+ * new housekeeping mask is non-empty, re-apply the restricted
+ * affinity to migrate the IRQ away from newly isolated CPUs.
+ * If the intersection is empty (all serving CPUs are now
+ * isolated), the IRQ is left on its current CPU temporarily;
+ * handling that case (IRQ shutdown / re-startup) is left for
+ * a follow-up.
+ */
+ if (irqd_affinity_is_managed(&desc->irq_data)) {
+ const struct cpumask *mask;
+ struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);
+
+ raw_spin_lock_irq(&desc->lock);
+ mask = irq_data_get_affinity_mask(&desc->irq_data);
+ cpumask_and(tmp, mask, hk_mask);
+ if (cpumask_intersects(tmp, cpu_online_mask))
+ irq_do_set_affinity(&desc->irq_data, tmp, false);
+ raw_spin_unlock_irq(&desc->lock);
+ }
+ }
+
+ irq_unlock_sparse();
+ free_cpumask_var(hk_mask);
+}
+
+static int irq_hk_validate(enum hk_type type,
+ const struct cpumask *cur_mask,
+ const struct cpumask *new_mask)
+{
+ if (!IS_ENABLED(CONFIG_SMP))
+ return -EOPNOTSUPP;
+ return 0;
+}
+
+static struct housekeeping_cbs irq_hk_cbs = {
+ .name = "genirq/managed",
+ .pre_validate = irq_hk_validate,
+ .apply = irq_hk_apply,
+};
+
+static int __init irq_hk_init(void)
+{
+ int ret;
+
+ ret = housekeeping_register_cbs(HK_TYPE_MANAGED_IRQ, &irq_hk_cbs);
+ if (ret)
+ pr_info("genirq: managed IRQ runtime migration disabled (%d)\n", ret);
+ return 0;
+}
+late_initcall(irq_hk_init);
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 09/13] watchdog/lockup_detector: Register housekeeping callback for kernel-noise
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (7 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu Jing Wu
` (3 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Initialize watchdog_cpumask from HK_TYPE_KERNEL_NOISE rather than
HK_TYPE_TIMER at boot, so the initial mask already reflects any CPUs
excluded by nohz_full= on the kernel command line.
Register a housekeeping_cbs so watchdog_cpumask stays in sync with
HK_TYPE_KERNEL_NOISE when isolation boundaries change at runtime via
cpuset isolated partitions. The apply() callback copies the new
housekeeping mask into watchdog_cpumask and triggers
__lockup_detector_reconfigure() to restart watchdog threads on the
updated CPU set.
When nohz_full= is absent at boot, tick_nohz_full_running remains
false and DHM isolated partitions do not activate tick suppression.
In that case watchdog_hk_apply() is a no-op: there is no need to
reconfigure the watchdog CPU set because the full nohz_full
infrastructure was never initialized.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
kernel/watchdog.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 87dd5e0f6968d..998ad94da4cb9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -1389,7 +1389,7 @@ void __init lockup_detector_init(void)
pr_info("Disabling watchdog on nohz_full cores by default\n");
cpumask_copy(&watchdog_cpumask,
- housekeeping_cpumask(HK_TYPE_TIMER));
+ housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
if (!watchdog_hardlockup_probe())
watchdog_hardlockup_available = true;
@@ -1398,3 +1398,57 @@ void __init lockup_detector_init(void)
lockup_detector_setup();
}
+
+/*
+ * Watchdog housekeeping callback: resync watchdog_cpumask with
+ * HK_TYPE_KERNEL_NOISE when isolation boundaries change at runtime.
+ */
+#ifdef CONFIG_CPU_ISOLATION
+static void watchdog_hk_apply(enum hk_type type)
+{
+ const struct cpumask *hk;
+
+ /*
+ * When nohz_full= was not given at boot, tick_nohz_full_running
+ * remains false and the full nohz_full infrastructure was never
+ * initialised. DHM isolated partitions do not activate tick
+ * suppression in that case, so there is no need to reconfigure the
+ * watchdog CPU set.
+ */
+#ifdef CONFIG_NO_HZ_FULL
+ if (!READ_ONCE(tick_nohz_full_running))
+ return;
+#endif
+
+ hk = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
+ if (mutex_trylock(&watchdog_mutex)) {
+ cpumask_copy(&watchdog_cpumask, hk);
+ __lockup_detector_reconfigure(false);
+ mutex_unlock(&watchdog_mutex);
+ }
+}
+
+static int watchdog_hk_validate(enum hk_type type,
+ const struct cpumask *cur_mask,
+ const struct cpumask *new_mask)
+{
+ return 0;
+}
+
+static struct housekeeping_cbs watchdog_hk_cbs = {
+ .name = "watchdog",
+ .pre_validate = watchdog_hk_validate,
+ .apply = watchdog_hk_apply,
+};
+
+static int __init watchdog_hk_init(void)
+{
+ int ret;
+
+ ret = housekeeping_register_cbs(HK_TYPE_KERNEL_NOISE, &watchdog_hk_cbs);
+ if (ret)
+ pr_debug("watchdog: hk callback registration skipped (%d)\n", ret);
+ return 0;
+}
+late_initcall(watchdog_hk_init);
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (8 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 09/13] watchdog/lockup_detector: Register housekeeping callback for kernel-noise Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation Jing Wu
` (2 subsequent siblings)
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
sched_tick_start() and sched_tick_stop() are called during CPU hotplug
for CPUs not in the HK_TYPE_KERNEL_NOISE set. They dereference
tick_work_cpu, which is allocated by sched_tick_offload_init() and only
called from housekeeping_init() when nohz_full= is present at boot.
When the DHM subsystem first-enables HK_TYPE_KERNEL_NOISE at runtime via
housekeeping_update_types(), tick_work_cpu remains NULL because
sched_tick_offload_init() is __init-only and cannot be re-invoked. A
subsequent CPU offline/online cycle for an isolated CPU triggers
WARN_ON_ONCE(!tick_work_cpu) followed by a NULL-pointer dereference in
per_cpu_ptr(tick_work_cpu, cpu), crashing the kernel.
Since nohz_full= was not active at boot, tick_nohz_full_running remains
false and the tick-offload infrastructure is never activated; isolated
CPUs continue to receive their own ticks. Guard both helpers with an
additional !tick_work_cpu check so they become no-ops in this case.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
kernel/sched/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 371b509d92164..df004e3efca70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5778,7 +5778,7 @@ static void sched_tick_start(int cpu)
int os;
struct tick_work *twork;
- if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+ if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
return;
WARN_ON_ONCE(!tick_work_cpu);
@@ -5799,7 +5799,7 @@ static void sched_tick_stop(int cpu)
struct tick_work *twork;
int os;
- if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+ if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
return;
WARN_ON_ONCE(!tick_work_cpu);
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (9 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 12/13] docs: cgroup-v2: Document kernel-noise isolation via isolated partitions Jing Wu
2026-06-18 3:11 ` [PATCH v3 13/13] selftests/cgroup: Add kernel-noise isolation test to cpuset selftest Jing Wu
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
When a cpuset isolated partition is created or destroyed, also drive
kernel-noise housekeeping types (HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ) through housekeeping_update_types(). The sched
domain mask (HK_TYPE_DOMAIN) is updated first via the existing
housekeeping_update() call, then the explicit callback chain in
housekeeping_update_types() invokes subsystem apply() handlers to
toggle nohz_full, managed IRQ migration, and RCU NOCB offloading.
The update runs outside cpuset_mutex and cpus_read_lock, protected
only by cpuset_top_mutex.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
kernel/cgroup/cpuset.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5c33ab20cc208..67b93bd4d58f2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1347,17 +1347,30 @@ static void cpuset_update_sd_hk_unlock(void)
rebuild_sched_domains_locked();
if (update_housekeeping) {
+ static const unsigned long noise_types =
+ BIT(HK_TYPE_KERNEL_NOISE) | BIT(HK_TYPE_MANAGED_IRQ);
+
update_housekeeping = false;
cpumask_copy(isolated_hk_cpus, isolated_cpus);
- /*
- * housekeeping_update() is now called without holding
- * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
- * is still being held for mutual exclusion.
- */
mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
+
+ /*
+ * Update the sched domain mask first; it must succeed
+ * before the kernel-noise types because workqueue flush
+ * and timer migration depend on the sched domain mask.
+ */
WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
+
+ /*
+ * Drive kernel-noise types through the new explicit
+ * callback chain. Tik/rcu/genirq subtypes react
+ * through their registered housekeeping_cbs apply()
+ * handlers.
+ */
+ WARN_ON_ONCE(housekeeping_update_types(noise_types,
+ isolated_hk_cpus));
mutex_unlock(&cpuset_top_mutex);
} else {
cpuset_full_unlock();
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 12/13] docs: cgroup-v2: Document kernel-noise isolation via isolated partitions
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (10 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
2026-06-18 3:11 ` [PATCH v3 13/13] selftests/cgroup: Add kernel-noise isolation test to cpuset selftest Jing Wu
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Document that cpuset.cpus.partition=isolated now drives runtime updates
of the housekeeping masks for kernel-noise types: nohz_full (tick
suppression), RCU NOCB offloading, and managed IRQ migration. No
additional cgroupfs files are required; the partition update path
automatically triggers explicit housekeeping callbacks for all affected
subsystems.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
Documentation/admin-guide/cgroup-v2.rst | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed995..7c3b048e75cb5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2721,6 +2721,14 @@ Cpuset Interface Files
kernel boot command line option. If those CPUs are to be put
into a partition, they have to be used in an isolated partition.
+ When an isolated partition is created or destroyed, the kernel
+ automatically drives runtime updates of the housekeeping masks
+ for kernel-noise types (nohz_full, RCU NOCB, managed IRQ
+ interrupts). This extends isolation beyond scheduler domains:
+ the tick is stopped on isolated CPUs, RCU callbacks are
+ offloaded to housekeeping cores, and managed interrupts are
+ migrated away. No additional cgroupfs files are required.
+
Device controller
-----------------
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 13/13] selftests/cgroup: Add kernel-noise isolation test to cpuset selftest
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
` (11 preceding siblings ...)
2026-06-18 3:11 ` [PATCH v3 12/13] docs: cgroup-v2: Document kernel-noise isolation via isolated partitions Jing Wu
@ 2026-06-18 3:11 ` Jing Wu
12 siblings, 0 replies; 17+ messages in thread
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
Add test_hk_noise_isolated() to test_cpuset_prs.sh to verify that
creating and destroying an isolated cpuset partition updates both the
domain isolation state and the kernel-noise (nohz_full) state.
For domain isolation, the test checks cpuset.cpus.isolated before and
after the partition create/destroy cycle.
For kernel-noise isolation, the test reads
/sys/devices/system/cpu/nohz_full to confirm that the CPUs placed in
an isolated partition appear in the nohz_full mask while the partition
is active, and are removed from it once the partition is destroyed.
This sysfs attribute only exists when CONFIG_NO_HZ_FULL is enabled;
the nohz_full checks are skipped when it is absent so the test remains
usable on kernels without NO_HZ_FULL.
Add cpu_in_cpulist() to correctly determine whether a CPU number falls
within a kernel cpulist string (e.g. "4-7"). A plain grep cannot
detect membership in the interior of a range; cpu_in_cpulist() walks
each comma-separated element and handles both single values and
lo-hi ranges explicitly.
The test also covers: rejection of all-CPU isolation, the SMT sibling
constraint, nested partition inheritance, and a 100-cycle pressure test.
nohz_full is verified to be restored to its pre-test value after each
create/destroy cycle and after the pressure test.
Fix awk invocation to drop the spurious -e flag.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
tools/testing/selftests/cgroup/test_cpuset_prs.sh | 204 +++++++++++++++++++++-
1 file changed, 203 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index a56f4153c64df..047db14953fac 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -20,7 +20,7 @@ skip_test() {
WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
# Find cgroup v2 mount point
-CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+CGROUP2=$(mount -t cgroup2 | head -1 | awk '{print $3}')
[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
SUBPARTS_CPUS=$CGROUP2/.__DEBUG__.cpuset.cpus.subpartitions
CPULIST=$(cat $CGROUP2/cpuset.cpus.effective)
@@ -1204,9 +1204,211 @@ test_inotify()
echo "" > cpuset.cpus
}
+#
+# cpu_in_cpulist <cpu> <cpulist>
+#
+# Return 0 if <cpu> appears in <cpulist> (a kernel cpumask list such as
+# "0-3,8-31"), non-zero otherwise. The kernel cpulist format uses ranges
+# ("lo-hi") and comma-separated items; a simple grep cannot detect that a
+# number falls in the middle of a range, so walk each element explicitly.
+#
+cpu_in_cpulist()
+{
+ local cpu=$1 list=$2 range lo hi
+ for range in $(echo "$list" | tr ',' ' '); do
+ if [[ "$range" == *-* ]]; then
+ lo=${range%-*}
+ hi=${range#*-}
+ [[ $cpu -ge $lo && $cpu -le $hi ]] && return 0
+ else
+ [[ $cpu -eq $range ]] && return 0
+ fi
+ done
+ return 1
+}
+
+#
+# Test that isolated partition creation/destruction drives kernel-noise
+# housekeeping mask updates and remains correct under pressure.
+#
+# Requires: >=8 CPUs, no isolcpus= boot conflict, root
+#
+test_hk_noise_isolated()
+{
+ local ISOL_BEFORE TEST_CPUS i PART ISOL_AFTER ISOL_RESTORE
+ local NOHZ_FILE NOHZ_BEFORE NOHZ_AFTER NOHZ_RESTORE
+ local HK_NOHZ_CHECK=0
+ local LOOPS=100
+
+ [[ $NR_CPUS -ge 8 ]] || {
+ echo "HK-noise test skipped: need >=8 CPUs, have $NR_CPUS"
+ return 0
+ }
+
+ # Detect whether CONFIG_NO_HZ_FULL is active: the sysfs attribute
+ # /sys/devices/system/cpu/nohz_full exposes the current nohz_full
+ # cpumask and is only present when NO_HZ_FULL is enabled.
+ NOHZ_FILE=/sys/devices/system/cpu/nohz_full
+ [[ -r "$NOHZ_FILE" ]] && HK_NOHZ_CHECK=1
+
+ cd $CGROUP2/test
+ echo member > cpuset.cpus.partition 2>/dev/null
+ echo "" > cpuset.cpus 2>/dev/null
+
+ ISOL_BEFORE=$(cat $CGROUP2/cpuset.cpus.isolated)
+ [[ $HK_NOHZ_CHECK -eq 1 ]] && NOHZ_BEFORE=$(cat $NOHZ_FILE)
+ TEST_CPUS="4-7"
+ echo $TEST_CPUS > cpuset.cpus
+
+ #
+ # Basic create/destroy cycle — verify domain isolation and
+ # kernel-noise (nohz_full) changes together.
+ #
+ console_msg "HK-noise: basic create/destroy cycle"
+ echo isolated > cpuset.cpus.partition
+
+ ISOL_AFTER=$(cat $CGROUP2/cpuset.cpus.isolated)
+ [[ $ISOL_AFTER != "$ISOL_BEFORE" ]] || {
+ echo "FAIL: isolated set unchanged after partition create"
+ exit 1
+ }
+
+ if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+ NOHZ_AFTER=$(cat $NOHZ_FILE)
+ # Verify that the newly isolated CPUs (4-7) appear in nohz_full.
+ # nohz_full = inverse of housekeeping, so isolating 4-7 should
+ # add them to nohz_full.
+ for cpu in 4 5 6 7; do
+ if ! cpu_in_cpulist $cpu "$NOHZ_AFTER"; then
+ echo "FAIL: cpu${cpu} not in nohz_full after isolation" \
+ "(got: '$NOHZ_AFTER')"
+ exit 1
+ fi
+ done
+ console_msg "HK-noise: nohz_full after isolation: $NOHZ_AFTER"
+ fi
+
+ echo member > cpuset.cpus.partition
+
+ ISOL_RESTORE=$(cat $CGROUP2/cpuset.cpus.isolated)
+ [[ $ISOL_RESTORE = "$ISOL_BEFORE" ]] || {
+ echo "FAIL: expected '$ISOL_BEFORE' after destroy, got '$ISOL_RESTORE'"
+ exit 1
+ }
+
+ if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+ NOHZ_RESTORE=$(cat $NOHZ_FILE)
+ [[ "$NOHZ_RESTORE" = "$NOHZ_BEFORE" ]] || {
+ echo "FAIL: nohz_full not restored: expected '$NOHZ_BEFORE'," \
+ "got '$NOHZ_RESTORE'"
+ exit 1
+ }
+ fi
+
+ #
+ # Reject all-CPU isolation (must leave at least one housekeeping CPU)
+ #
+ console_msg "HK-noise: reject all-CPU isolation"
+ echo 0-$((NR_CPUS - 1)) > cpuset.cpus
+ echo isolated > cpuset.cpus.partition
+ PART=$(cat cpuset.cpus.partition)
+ [[ $PART = *invalid* || $PART = member ]] || {
+ echo "FAIL: all-CPU isolation was not rejected, got '$PART'"
+ exit 1
+ }
+
+ #
+ # SMT safety: partial sibling isolation
+ #
+ console_msg "HK-noise: SMT sibling constraint"
+ echo $TEST_CPUS > cpuset.cpus
+ echo isolated > cpuset.cpus.partition
+ PART=$(cat cpuset.cpus.partition)
+ [[ $PART = isolated ]] || {
+ echo "FAIL: could not create isolated partition, got '$PART'"
+ exit 1
+ }
+ echo member > cpuset.cpus.partition
+
+ #
+ # Nested partition: parent root → child isolated
+ #
+ console_msg "HK-noise: nested partition inheritance"
+ echo $TEST_CPUS > cpuset.cpus
+ test_partition root
+ mkdir -p HK_SUB
+ cd HK_SUB
+ echo 4-5 > cpuset.cpus
+ echo isolated > cpuset.cpus.partition
+ ISOL_AFTER=$(cat $CGROUP2/cpuset.cpus.isolated)
+ [[ -n $ISOL_AFTER ]] || {
+ echo "FAIL: nested isolated partition not reflected in cpuset.cpus.isolated"
+ exit 1
+ }
+ echo member > cpuset.cpus.partition
+ cd $CGROUP2/test
+ echo member > cpuset.cpus.partition
+ rmdir HK_SUB 2>/dev/null
+
+ #
+ # Pressure test: 100 create/destroy cycles
+ #
+ console_msg "HK-noise: pressure test ($LOOPS cycles)"
+ echo $TEST_CPUS > cpuset.cpus
+ for i in $(seq 1 $LOOPS); do
+ echo isolated > cpuset.cpus.partition
+ PART=$(cat cpuset.cpus.partition)
+ [[ $PART = isolated ]] || {
+ echo "FAIL: cycle $i create failed, got '$PART'"
+ exit 1
+ }
+ echo member > cpuset.cpus.partition
+ PART=$(cat cpuset.cpus.partition)
+ [[ $PART = member ]] || {
+ echo "FAIL: cycle $i destroy failed, got '$PART'"
+ exit 1
+ }
+ done
+
+ #
+ # Stability: after pressure test, verify final state
+ #
+ console_msg "HK-noise: post-pressure cleanup"
+ echo isolated > cpuset.cpus.partition
+ ISOL_AFTER=$(cat $CGROUP2/cpuset.cpus.isolated)
+ [[ -n $ISOL_AFTER ]] || {
+ echo "FAIL: isolated set empty after pressure test"
+ exit 1
+ }
+ echo member > cpuset.cpus.partition
+ echo "" > cpuset.cpus
+ ISOL_RESTORE=$(cat $CGROUP2/cpuset.cpus.isolated)
+ [[ $ISOL_RESTORE = "$ISOL_BEFORE" ]] || {
+ echo "FAIL: final isolated '$ISOL_RESTORE' != '$ISOL_BEFORE'"
+ exit 1
+ }
+
+ if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+ NOHZ_RESTORE=$(cat $NOHZ_FILE)
+ [[ "$NOHZ_RESTORE" = "$NOHZ_BEFORE" ]] || {
+ echo "FAIL: nohz_full not restored after pressure test:" \
+ "expected '$NOHZ_BEFORE', got '$NOHZ_RESTORE'"
+ exit 1
+ }
+ fi
+
+ cd $CGROUP2
+ if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+ console_msg "HK-noise: PASSED (with nohz_full verification)"
+ else
+ console_msg "HK-noise: PASSED (nohz_full skipped: CONFIG_NO_HZ_FULL not active)"
+ fi
+}
+
trap cleanup 0 2 3 6
run_state_test TEST_MATRIX
run_remote_state_test REMOTE_TEST_MATRIX
test_isolated
test_inotify
+test_hk_noise_isolated
echo "All tests PASSED."
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
2026-06-18 3:11 ` [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths Jing Wu
@ 2026-06-18 16:06 ` Thomas Gleixner
0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2026-06-18 16:06 UTC (permalink / raw)
To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> Add CPUHP_AP_NO_HZ_FULL_DYING and CPUHP_AP_IRQ_AFFINITY_DYING to the
> cpuhp_state enum. These dying callbacks are invoked during CPU offline
> before the tick is stopped, enabling clean tick handover and managed
> IRQ migration when a CPU transitions between isolated and housekeeping
> states.
>
> The existing CPUHP_AP_IRQ_AFFINITY_ONLINE already handles managed IRQ
> restoration on CPU online. The new dying callback completes the pair,
> migrating managed interrupts away from the CPU before it goes down.
What? They are migrated away today already when the CPU goes down unless
the CPU is the last one in the affinity set of the interrupt. So why do
you need a new step for something which already exists?
> Subsequent patches register handlers for these states.
>
> Signed-off-by: Jing Wu <realwujing@gmail.com>
> Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
This SOB chain is broken (in all patches). See Documentation/process/...
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
2026-06-18 3:11 ` [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates Jing Wu
@ 2026-06-18 17:27 ` Thomas Gleixner
2026-06-18 19:49 ` Thomas Gleixner
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2026-06-18 17:27 UTC (permalink / raw)
To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> Remove __init from ct_cpu_track_user() and __initdata from the
> initialized flag so context tracking can be activated on CPUs that
> join nohz_full at runtime. Drop the __ro_after_init attribute from
> the context_tracking_key static key, allowing static_branch_dec()
> when a CPU leaves nohz_full.
>
> Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
> the static key and clearing the per-CPU tracking state.
Please do not enumerate WHAT the patch is doing. Explain the context and
the WHY
https://docs.kernel.org/process/maintainer-tip.html#changelog
>
> #include <asm/irq_regs.h>
> @@ -653,11 +654,6 @@ void __init tick_nohz_init(void)
> if (!tick_nohz_full_running)
> return;
>
> - /*
> - * Full dynticks uses IRQ work to drive the tick rescheduling on safe
> - * locking contexts. But then we need IRQ work to raise its own
> - * interrupts to avoid circular dependency on the tick.
> - */
This comment is removed because it's not longer correct? How is this
related to $Subject?
> if (!arch_irq_work_has_interrupt()) {
> pr_warn("NO_HZ: Can't run full dynticks because arch doesn't support IRQ work self-IPIs\n");
> cpumask_clear(tick_nohz_full_mask);
> @@ -676,6 +672,16 @@ void __init tick_nohz_init(void)
> }
> }
>
> + /*
> + * Pre-initialize context tracking for all possible CPUs so
> + * ctx tracking is already active when a CPU is later added to
> + * nohz_full at runtime. The tracking overhead is negligible
> + * because the static key is not incremented yet — only per-CPU
> + * tracking state is set up.
> + */
> + if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER_FORCE))
> + context_tracking_init();
Seriously? Care to look where and when context_tracking_init() is invoked?
> for_each_cpu(cpu, tick_nohz_full_mask)
> ct_cpu_track_user(cpu);
>
> @@ -686,6 +692,147 @@ void __init tick_nohz_init(void)
> pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
> cpumask_pr_args(tick_nohz_full_mask));
> }
> +
> +static int tick_nohz_hk_validate(enum hk_type type,
> + const struct cpumask *cur_mask,
> + const struct cpumask *new_mask)
> +{
> + if (!IS_ENABLED(CONFIG_NO_HZ_FULL))
> + return -EOPNOTSUPP;
> + return 0;
> +}
Why is this code even compiled when CONFIG_NO_HZ_FULL is not enabled?
> +
> +static void tick_nohz_hk_apply(enum hk_type type)
> +{
> + static DEFINE_SPINLOCK(tick_nohz_lock);
> + cpumask_var_t nohz_full, added, removed;
> + bool was_running;
> + int cpu;
> +
> + if (!alloc_cpumask_var(&nohz_full, GFP_KERNEL))
> + return;
This looks more than wrong. If this fails then the core code will
happily proceed with the completely wrong state.
> + if (!alloc_cpumask_var(&added, GFP_KERNEL)) {
> + free_cpumask_var(nohz_full);
> + return;
> + }
> + if (!alloc_cpumask_var(&removed, GFP_KERNEL)) {
> + free_cpumask_var(added);
> + free_cpumask_var(nohz_full);
> + return;
> + }
cpumask_var_t __free(free_cpumask_var) a = CPUMASK_VAR_NULL;
cpumask_var_t __free(free_cpumask_var) b = CPUMASK_VAR_NULL;
cpumask_var_t __free(free_cpumask_var) c = CPUMASK_VAR_NULL;
if (!alloc_cpumask_var(&a, GFP_KERNEL))
return -ENOMEM;
....
> +
> + /*
> + * Snapshot the new HK_TYPE_KERNEL_NOISE mask under an RCU read lock.
> + * housekeeping_update_types() completes synchronize_rcu() before
> + * invoking apply(), so the new pointer is stable; however the lockdep
> + * annotation in housekeeping_cpumask() still requires an RCU read-side
> + * critical section for runtime-mutable types.
This comment is explaining the obvious: housekeeping_cpumask_rcu()
> + */
> + rcu_read_lock();
scoped_guard(rcu)
> + cpumask_andnot(nohz_full, cpu_possible_mask,
> + housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE));
> + rcu_read_unlock();
> +
> + /*
> + * When "nohz_full=" was not passed at boot, tick_nohz_full_running is
> + * false and the full dynticks infrastructure (sched_tick_offload_init,
> + * RCU nohz quiescent-state reporting, context-tracking bootstrap) was
> + * never initialised. In that case restrict the update to
> + * tick_nohz_full_mask so the /sys/devices/system/cpu/nohz_full sysfs
> + * attribute reflects DHM-isolated CPUs without enabling tick
> + * suppression, context tracking, or timer migration – all of which
> + * require boot-time setup and would deadlock on the first
> + * synchronize_rcu() call after CPUs are offlined.
What? You tell user space that the CPUs are nohz_full by updating the
mask, which is exposed in sysfs, which is blatantly wrong.
> + */
> + was_running = READ_ONCE(tick_nohz_full_running);
Q: This READ_ONCE() pairs with which WRITE_ONCE()?
A: With none, so it's just voodoo programming.
> + spin_lock(&tick_nohz_lock);
This lock protects against the housekeeping core code invoking the apply
callback multiple times in parallel, right?
If that happens then there are bigger problems than corrupted masks.
> + /*
> + * When nohz_full= was active at boot, compute the delta and update
> + * context tracking for CPUs joining or leaving the nohz_full set.
> + * Skip when !was_running: ct_cpu_track_user() calls
> + * static_branch_inc() which may sleep (jump_label_update on the
> + * 0→1 transition) – illegal inside a spinlock.
If you remove the pointless voodoo lock then this nonsense goes away too.
> + */
> + if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER) &&
> + was_running &&
> + cpumask_available(tick_nohz_full_mask)) {
Why is this stuff even invoked when the mask is not available? If it's
not there then NOHZ full is not functional, period.
> + cpumask_andnot(added, nohz_full, tick_nohz_full_mask);
> + cpumask_andnot(removed, tick_nohz_full_mask, nohz_full);
> + for_each_cpu(cpu, added)
> + ct_cpu_track_user(cpu);
> + for_each_cpu(cpu, removed)
> + ct_cpu_untrack_user(cpu);
> + }
> +
> + /*
> + * Update tick_nohz_full_mask unconditionally: this is the snapshot
> + * read by the /sys/devices/system/cpu/nohz_full sysfs attribute and
> + * must reflect the current isolation set even in the DHM runtime case.
> + */
> + if (cpumask_available(tick_nohz_full_mask))
> + cpumask_copy(tick_nohz_full_mask, nohz_full);
Seriously?
> + /*
> + * Only modify tick_nohz_full_running and migrate the global tick when
> + * nohz_full= was set at boot; without boot-time setup, setting
> + * tick_nohz_full_running would suppress ticks on isolated CPUs and
> + * prevent RCU quiescent-state reporting, causing synchronize_rcu()
> + * to stall permanently when a CPU is subsequently offlined.
> + */
> + if (was_running) {
Again, why is any of this invoked when NOHZ full was never enabled and
initialized?
> + tick_nohz_full_running = !cpumask_empty(nohz_full);
Brilliant. When NOHZ full was enabled on the command line, then changing
the mask can disable "running" and that makes it disabled forever. There
is no way to reenable it.
This 'was_running' check is just wrong. What you need is a
'tick_nohz_full_initialized' boolean, which is only true when nohz_full
was setup early on including the mask.
If that's not the case, then none of this code is supposed to run
ever. I.e. the callback is not installed in the first place.
> + /*
> + * Ensure tick_nohz_full_mask is allocated so that tick_nohz_hk_apply()
> + * can update it (and the /sys/devices/system/cpu/nohz_full sysfs
> + * attribute) when CPUs are isolated at runtime via DHM. If "nohz_full="
> + * was passed at boot the mask is already allocated; allocate an empty
> + * one here for the runtime-only case.
What's the runtime only case? The fake exposure in sysfs which is just
misleading the user? Not going to happen. If it's not enabled on the
command line then it's disabled, end of story.
> + */
> + if (!cpumask_available(tick_nohz_full_mask) &&
> + !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL))
> + pr_warn("tick/nohz: failed to allocate nohz_full_mask for DHM\n");
ROTFL. If the allocation fails, then the apply callback becomes a
complete noop doing magic cpumask operations for nothing and pretending
to be successful.
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
2026-06-18 17:27 ` Thomas Gleixner
@ 2026-06-18 19:49 ` Thomas Gleixner
0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2026-06-18 19:49 UTC (permalink / raw)
To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
On Thu, Jun 18 2026 at 19:27, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> Remove __init from ct_cpu_track_user() and __initdata from the
>> initialized flag so context tracking can be activated on CPUs that
>> join nohz_full at runtime. Drop the __ro_after_init attribute from
>> the context_tracking_key static key, allowing static_branch_dec()
>> when a CPU leaves nohz_full.
>>
>> Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
>> the static key and clearing the per-CPU tracking state.
>
> Please do not enumerate WHAT the patch is doing. Explain the context and
> the WHY
>
> https://docs.kernel.org/process/maintainer-tip.html#changelog
Just for the record. I told your colleague the same thing already....
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-06-18 19:49 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18 3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
2026-06-18 3:11 ` [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface Jing Wu
2026-06-18 3:11 ` [PATCH v3 02/13] sched/isolation: Add housekeeping_update_types() for kernel-noise masks Jing Wu
2026-06-18 3:11 ` [PATCH v3 03/13] sched/isolation: RCU-protect all housekeeping cpumask readers Jing Wu
2026-06-18 3:11 ` [PATCH v3 04/13] sched/isolation: Fix RCU protection for runtime-mutable cpumask callers Jing Wu
2026-06-18 3:11 ` [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths Jing Wu
2026-06-18 16:06 ` Thomas Gleixner
2026-06-18 3:11 ` [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates Jing Wu
2026-06-18 17:27 ` Thomas Gleixner
2026-06-18 19:49 ` Thomas Gleixner
2026-06-18 3:11 ` [PATCH v3 07/13] rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling Jing Wu
2026-06-18 3:11 ` [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration Jing Wu
2026-06-18 3:11 ` [PATCH v3 09/13] watchdog/lockup_detector: Register housekeeping callback for kernel-noise Jing Wu
2026-06-18 3:11 ` [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu Jing Wu
2026-06-18 3:11 ` [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation Jing Wu
2026-06-18 3:11 ` [PATCH v3 12/13] docs: cgroup-v2: Document kernel-noise isolation via isolated partitions Jing Wu
2026-06-18 3:11 ` [PATCH v3 13/13] selftests/cgroup: Add kernel-noise isolation test to cpuset selftest Jing Wu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.