* [PATCH v3 04/13] sched/isolation: Fix RCU protection for runtime-mutable cpumask callers
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>
housekeeping_update_types() installs new cpumasks via rcu_assign_pointer()
and frees the old ones after synchronize_rcu(); callers that dereference
the old pointer without holding an RCU read lock can access freed memory.
Fix the four call sites:
kernel/sched/core.c (get_nohz_timer_target, HK_TYPE_KERNEL_NOISE):
The guard(rcu)() was acquired after housekeeping_cpumask(). Move it
before the call and switch to housekeeping_cpumask_rcu() so hk_mask
is read inside the RCU read-side critical section. HK_TYPE_KERNEL_NOISE
is updated at runtime by housekeeping_update_types(); this fix is
required for correctness.
drivers/hv/channel_mgmt.c (init_vp_index, HK_TYPE_MANAGED_IRQ):
The function stored the raw pointer in a local variable and used it
across GFP_KERNEL allocations (which can sleep, so an RCU read lock
cannot span them). Allocate both cpumask_var_t buffers first, then
snapshot the housekeeping mask under a brief rcu_read_lock() and use
the snapshot throughout. HK_TYPE_MANAGED_IRQ is updated at runtime;
this fix is required for correctness.
kernel/time/hrtimer.c (get_target_base, HK_TYPE_TIMER):
cpumask_any_and() against housekeeping_cpumask(HK_TYPE_TIMER) was
called without any lock. Wrap with rcu_read_lock()/rcu_read_unlock()
and use housekeeping_cpumask_rcu(). HK_TYPE_TIMER is not changed at
runtime in this series; this is a defensive fix to satisfy the
housekeeping_dereference_check() lockdep annotation for future-proofing.
hrtimers_cpu_dying() is already safe: it runs under the cpu_hotplug_lock
write side, which housekeeping_dereference_check() already permits.
arch/arm64/kernel/topology.c (arch_freq_get_on_cpu, HK_TYPE_TICK):
cpumask_intersects() against housekeeping_cpumask(HK_TYPE_TICK) was
called without any lock. Evaluate under rcu_read_lock() and store
the boolean result before releasing the lock. HK_TYPE_TICK is not
changed at runtime in this series; this is a defensive fix.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
arch/arm64/kernel/topology.c | 9 ++++++--
drivers/hv/channel_mgmt.c | 50 ++++++++++++++++++++++++++++++--------------
kernel/sched/core.c | 3 +--
kernel/time/hrtimer.c | 5 ++++-
4 files changed, 46 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index b32f13358fbb1..8f4329b57cea7 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -212,8 +212,13 @@ int arch_freq_get_on_cpu(int cpu)
if (!policy)
return -EINVAL;
- if (!cpumask_intersects(policy->related_cpus,
- housekeeping_cpumask(HK_TYPE_TICK))) {
+ bool no_hk_in_policy;
+
+ rcu_read_lock();
+ no_hk_in_policy = !cpumask_intersects(policy->related_cpus,
+ housekeeping_cpumask_rcu(HK_TYPE_TICK));
+ rcu_read_unlock();
+ if (no_hk_in_policy) {
cpufreq_cpu_put(policy);
return -EOPNOTSUPP;
}
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 84eb0a6a0b546..fc5247e92e1b3 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -750,26 +750,43 @@ static void init_vp_index(struct vmbus_channel *channel)
{
bool perf_chn = hv_is_perf_channel(channel);
u32 i, ncpu = num_online_cpus();
- cpumask_var_t available_mask;
+ cpumask_var_t available_mask, hk_snap;
struct cpumask *allocated_mask;
- const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
u32 target_cpu;
int numa_node;
- if (!perf_chn ||
- !alloc_cpumask_var(&available_mask, GFP_KERNEL) ||
- cpumask_empty(hk_mask)) {
- /*
- * If the channel is not a performance critical
- * channel, bind it to VMBUS_CONNECT_CPU.
- * In case alloc_cpumask_var() fails, bind it to
- * VMBUS_CONNECT_CPU.
- * If all the cpus are isolated, bind it to
- * VMBUS_CONNECT_CPU.
- */
+ if (!perf_chn) {
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ return;
+ }
+
+ if (!alloc_cpumask_var(&available_mask, GFP_KERNEL)) {
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
+ return;
+ }
+
+ /*
+ * Snapshot HK_TYPE_MANAGED_IRQ cpumask under RCU read lock.
+ * housekeeping_update_types() frees the old cpumask after
+ * synchronize_rcu(), so we must not hold the pointer beyond an
+ * RCU read-side critical section.
+ */
+ if (!alloc_cpumask_var(&hk_snap, GFP_KERNEL)) {
+ free_cpumask_var(available_mask);
+ channel->target_cpu = VMBUS_CONNECT_CPU;
+ hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
+ return;
+ }
+ rcu_read_lock();
+ cpumask_copy(hk_snap, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
+ rcu_read_unlock();
+
+ if (cpumask_empty(hk_snap)) {
+ free_cpumask_var(hk_snap);
+ free_cpumask_var(available_mask);
channel->target_cpu = VMBUS_CONNECT_CPU;
- if (perf_chn)
- hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
+ hv_set_allocated_cpu(VMBUS_CONNECT_CPU);
return;
}
@@ -788,7 +805,7 @@ static void init_vp_index(struct vmbus_channel *channel)
retry:
cpumask_xor(available_mask, allocated_mask, cpumask_of_node(numa_node));
- cpumask_and(available_mask, available_mask, hk_mask);
+ cpumask_and(available_mask, available_mask, hk_snap);
if (cpumask_empty(available_mask)) {
/*
@@ -809,6 +826,7 @@ static void init_vp_index(struct vmbus_channel *channel)
channel->target_cpu = target_cpu;
+ free_cpumask_var(hk_snap);
free_cpumask_var(available_mask);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c69..371b509d92164 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1272,9 +1272,8 @@ int get_nohz_timer_target(void)
default_cpu = cpu;
}
- hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
-
guard(rcu)();
+ hk_mask = housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE);
for_each_domain(cpu, sd) {
for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5bd6efe598f0f..18e17a9dad67b 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -242,8 +242,11 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
{
if (!hrtimer_base_is_online(base)) {
- int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+ int cpu;
+ rcu_read_lock();
+ cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask_rcu(HK_TYPE_TIMER));
+ rcu_read_unlock();
return &per_cpu(hrtimer_bases, cpu);
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 03/13] sched/isolation: RCU-protect all housekeeping cpumask readers
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>
Extend housekeeping_dereference_check() to validate all runtime-mutable
types (HK_TYPE_DOMAIN, HK_TYPE_KERNEL_NOISE, HK_TYPE_MANAGED_IRQ), not
only HK_TYPE_DOMAIN. Boot-only types (HK_TYPE_DOMAIN_BOOT) remain
unchecked.
Add housekeeping_cpumask_rcu() for callers that already hold an RCU
read lock. This variant uses rcu_dereference() without the lockdep
annotation, avoiding false-positive lockdep warnings in RCU read-side
critical sections.
Use READ_ONCE() consistently when testing housekeeping.flags in paths
that may race with housekeeping_update_types().
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/sched/isolation.h | 6 +++++
kernel/sched/isolation.c | 57 +++++++++++++++++++++++++++++++----------
2 files changed, 49 insertions(+), 14 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index eecbcbe802bd0..ed6e1c6980131 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -40,6 +40,7 @@ enum hk_type {
DECLARE_STATIC_KEY_FALSE(housekeeping_overridden);
extern int housekeeping_any_cpu(enum hk_type type);
extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
+extern const struct cpumask *housekeeping_cpumask_rcu(enum hk_type type);
extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
@@ -87,6 +88,11 @@ static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
return cpu_possible_mask;
}
+static inline const struct cpumask *housekeeping_cpumask_rcu(enum hk_type type)
+{
+ return cpu_possible_mask;
+}
+
static inline bool housekeeping_enabled(enum hk_type type)
{
return false;
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 4eca18cc5e8ce..3d5d3f12853c7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -121,25 +121,40 @@ bool housekeeping_enabled(enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_enabled);
+/*
+ * Types that can change at runtime via cpuset isolated partitions.
+ * Boot-only types (DOMAIN_BOOT) are always safe to read without lockdep.
+ */
+static bool housekeeping_type_can_change(enum hk_type type)
+{
+ switch (type) {
+ case HK_TYPE_DOMAIN:
+ case HK_TYPE_KERNEL_NOISE:
+ case HK_TYPE_MANAGED_IRQ:
+ return true;
+ default:
+ return false;
+ }
+}
+
static bool housekeeping_dereference_check(enum hk_type type)
{
- if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
- /* Cpuset isn't even writable yet? */
- if (system_state <= SYSTEM_SCHEDULING)
- return true;
+ if (!IS_ENABLED(CONFIG_LOCKDEP) || !housekeeping_type_can_change(type))
+ return true;
- /* CPU hotplug write locked, so cpuset partition can't be overwritten */
- if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
- return true;
+ /* Cpuset isn't even writable yet? */
+ if (system_state <= SYSTEM_SCHEDULING)
+ return true;
- /* Cpuset lock held, partitions not writable */
- if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
- return true;
+ /* CPU hotplug write locked, so cpuset partition can't be overwritten */
+ if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
+ return true;
- return false;
- }
+ /* Cpuset lock held, partitions not writable */
+ if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
+ return true;
- return true;
+ return false;
}
static inline struct cpumask *housekeeping_cpumask_dereference(enum hk_type type)
@@ -162,12 +177,26 @@ const struct cpumask *housekeeping_cpumask(enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_cpumask);
+const struct cpumask *housekeeping_cpumask_rcu(enum hk_type type)
+{
+ const struct cpumask *mask = NULL;
+
+ if (static_branch_unlikely(&housekeeping_overridden)) {
+ if (READ_ONCE(housekeeping.flags) & BIT(type))
+ mask = rcu_dereference(housekeeping.cpumasks[type]);
+ }
+ if (!mask)
+ mask = cpu_possible_mask;
+ return mask;
+}
+EXPORT_SYMBOL_GPL(housekeeping_cpumask_rcu);
+
int housekeeping_any_cpu(enum hk_type type)
{
int cpu;
if (static_branch_unlikely(&housekeeping_overridden)) {
- if (housekeeping.flags & BIT(type)) {
+ if (READ_ONCE(housekeeping.flags) & BIT(type)) {
cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
if (cpu < nr_cpu_ids)
return cpu;
--
2.43.0
^ permalink raw reply related
* [PATCH v3 02/13] sched/isolation: Add housekeeping_update_types() for kernel-noise masks
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>
Introduce housekeeping_update_types(), which updates the cpumask for
each specified housekeeping type atomically using an RCU pointer swap.
For each type in @type_mask the trial mask is computed as
(base & ~isol_mask), where the base depends on the type:
- Most types use the current housekeeping cpumask as base. For
types that are only set at boot this is equivalent to the boot
mask, so trial = (boot_mask & ~isol_mask).
- HK_TYPE_KERNEL_NOISE always uses cpu_possible_mask as base. Its
semantics are "all possible CPUs minus the currently-isolated set";
using the current HK mask instead would leave it stuck at its last
non-trivial value after de-isolation, breaking subsequent isolation
cycles.
HK_TYPE_KERNEL_NOISE also supports runtime first-enable: if it was not
registered at boot (no nohz_full= on the kernel command line),
housekeeping_update_types() registers it in housekeeping.flags on the
first call. All other types must already be boot-enabled.
For each type the function validates the trial mask against
cpu_online_mask, runs registered pre_validate() callbacks (which may
reject the update), swaps all RCU cpumask pointers in a single pass,
calls synchronize_rcu(), frees the old masks, and then runs apply()
callbacks.
The existing housekeeping_update() continues to update only
HK_TYPE_DOMAIN and remains the entry point for the cpuset partition
path. housekeeping_update_types() enables the partition path to also
drive the kernel-noise types (HK_TYPE_KERNEL_NOISE,
HK_TYPE_MANAGED_IRQ) through the explicit callback interface added in
the previous patch.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/sched/isolation.h | 4 ++
kernel/sched/isolation.c | 112 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index f362876b3ebdf..eecbcbe802bd0 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -44,6 +44,8 @@ extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
extern int housekeeping_update(struct cpumask *isol_mask);
+extern int housekeeping_update_types(unsigned long type_mask,
+ struct cpumask *isol_mask);
extern void __init housekeeping_init(void);
/**
@@ -99,6 +101,8 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
}
static inline int housekeeping_update(struct cpumask *isol_mask) { return 0; }
+static inline int housekeeping_update_types(unsigned long type_mask,
+ struct cpumask *isol_mask) { return 0; }
static inline void housekeeping_init(void) { }
static inline int housekeeping_register_cbs(enum hk_type type,
struct housekeeping_cbs *cbs) { return 0; }
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index aae4dff7fbfc8..4eca18cc5e8ce 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -249,6 +249,118 @@ int housekeeping_update(struct cpumask *isol_mask)
return 0;
}
+/**
+ * housekeeping_update_types - Update housekeeping masks for specified types
+ * @type_mask: Bitmask of housekeeping types to update
+ * @isol_mask: CPUs being added to the isolation set
+ *
+ * For each type in @type_mask that was enabled at boot, compute the
+ * trial mask as (boot mask & ~@isol_mask), validate it against
+ * @cpu_online_mask, invoke pre_validate() callbacks, swap the RCU
+ * mask pointer, and run apply() callbacks after synchronize_rcu().
+ *
+ * HK_TYPE_KERNEL_NOISE also supports runtime first-enable: when an
+ * isolated cpuset partition is created without nohz_full= at boot,
+ * cpu_possible_mask is used as the initial base and the type flag is
+ * set in housekeeping.flags on the first call.
+ *
+ * Return: 0 on success, -ENOMEM on allocation failure, -EINVAL if
+ * a trial mask has no online CPUs.
+ */
+int housekeeping_update_types(unsigned long type_mask,
+ struct cpumask *isol_mask)
+{
+ struct cpumask *trials[HK_TYPE_MAX] = {};
+ struct cpumask *old_masks[HK_TYPE_MAX] = {};
+ enum hk_type type;
+ int ret = 0;
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ const struct cpumask *base;
+
+ if (type == HK_TYPE_DOMAIN_BOOT)
+ continue;
+ if (!housekeeping_enabled(type)) {
+ /*
+ * HK_TYPE_KERNEL_NOISE supports runtime first-enable
+ * for DHM isolated partitions created without nohz_full=
+ * at boot. All other types must be boot-enabled.
+ */
+ if (type != HK_TYPE_KERNEL_NOISE)
+ continue;
+ }
+
+ /*
+ * HK_TYPE_KERNEL_NOISE always uses cpu_possible_mask as its
+ * base. Its semantics are exactly "cpu_possible minus the
+ * currently-isolated set", so the base never shrinks across
+ * successive isolation/de-isolation cycles. If we used the
+ * current HK mask instead, de-isolating all partitions would
+ * leave the mask at its last non-trivial value rather than
+ * reverting to cpu_possible, breaking subsequent isolations.
+ */
+ if (type == HK_TYPE_KERNEL_NOISE)
+ base = cpu_possible_mask;
+ else
+ base = housekeeping_cpumask(type);
+ trials[type] = kmalloc(cpumask_size(), GFP_KERNEL);
+ if (!trials[type]) {
+ ret = -ENOMEM;
+ goto err_free;
+ }
+ cpumask_andnot(trials[type], base, isol_mask);
+ if (!cpumask_intersects(trials[type], cpu_online_mask)) {
+ ret = -EINVAL;
+ goto err_free;
+ }
+ }
+
+ if (!housekeeping.flags) {
+ ret = -EINVAL;
+ goto err_free;
+ }
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ if (!trials[type])
+ continue;
+ ret = housekeeping_pre_validate_cbs(type,
+ housekeeping_cpumask(type),
+ trials[type]);
+ if (ret < 0)
+ goto err_free;
+ }
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ if (!trials[type])
+ continue;
+ old_masks[type] = housekeeping_cpumask_dereference(type);
+ /* First-time runtime enable: register the type now. */
+ if (!housekeeping_enabled(type))
+ WRITE_ONCE(housekeeping.flags,
+ housekeeping.flags | BIT(type));
+ rcu_assign_pointer(housekeeping.cpumasks[type], trials[type]);
+ trials[type] = NULL;
+ }
+
+ synchronize_rcu();
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX) {
+ if (housekeeping_cbs_table[type].nr == 0)
+ continue;
+ housekeeping_apply_cbs(type);
+ }
+
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX)
+ kfree(old_masks[type]);
+
+ return 0;
+
+err_free:
+ for_each_set_bit(type, &type_mask, HK_TYPE_MAX)
+ kfree(trials[type]);
+ return ret;
+}
+
void __init housekeeping_init(void)
{
enum hk_type type;
--
2.43.0
^ permalink raw reply related
* [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>
Replace the blocking notifier chain with an explicit per-type callback
table (struct housekeeping_cbs). Each subsystem registers callbacks
at initcall time; pre_validate() runs before the RCU pointer swap to
allow rejecting the update, and apply() runs after synchronize_rcu()
when the new mask is visible to readers.
The table is limited to HK_MAX_CBS (4) slots per type, sufficient for
the kernel-noise subsystems and avoiding unbounded dynamic allocation
in the update path. The interface provides deterministic callback
order and explicit registration, giving each subsystem maintainer clear
visibility into when and why its callback is invoked — unlike the
opaque priority-based dispatch of notifier chains.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
include/linux/sched/isolation.h | 31 +++++++++++++++
kernel/sched/isolation.c | 87 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cf0fd03dd7a24..f362876b3ebdf 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -46,6 +46,33 @@ extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
extern int housekeeping_update(struct cpumask *isol_mask);
extern void __init housekeeping_init(void);
+/**
+ * struct housekeeping_cbs - Per-subsystem callbacks for housekeeping mask changes
+ * @name: Subsystem name for diagnostic messages
+ * @pre_validate: Run before RCU pointer swap. Return -EINVAL
+ * to reject the update.
+ * @apply: Run after synchronize_rcu(). Reconfigure subsystem
+ * state. The new mask is visible to readers.
+ *
+ * Register subsystem callbacks at initcall time.
+ * Invoke callbacks in registration order when the corresponding
+ * housekeeping mask changes. Skip types not present in the update
+ * mask.
+ *
+ * Replace the notifier-chain pattern with deterministic callback
+ * ordering.
+ */
+struct housekeeping_cbs {
+ const char *name;
+ int (*pre_validate)(enum hk_type type,
+ const struct cpumask *cur_mask,
+ const struct cpumask *new_mask);
+ void (*apply)(enum hk_type type);
+};
+
+int housekeeping_register_cbs(enum hk_type type, struct housekeeping_cbs *cbs);
+int housekeeping_unregister_cbs(enum hk_type type, struct housekeeping_cbs *cbs);
+
#else
static inline int housekeeping_any_cpu(enum hk_type type)
@@ -73,6 +100,10 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
static inline int housekeeping_update(struct cpumask *isol_mask) { return 0; }
static inline void housekeeping_init(void) { }
+static inline int housekeeping_register_cbs(enum hk_type type,
+ struct housekeeping_cbs *cbs) { return 0; }
+static inline int housekeeping_unregister_cbs(enum hk_type type,
+ struct housekeeping_cbs *cbs) { return 0; }
#endif /* CONFIG_CPU_ISOLATION */
static inline bool housekeeping_cpu(int cpu, enum hk_type type)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ef152d401fe20..aae4dff7fbfc8 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -28,6 +28,93 @@ struct housekeeping {
static struct housekeeping housekeeping;
+/*
+ * Maintain an explicit callback table indexed by housekeeping type.
+ * Invoke callbacks for affected types in deterministic order:
+ * pre_validate() before the RCU pointer swap, apply() after
+ * synchronize_rcu().
+ */
+#define HK_MAX_CBS 4
+
+static struct {
+ struct housekeeping_cbs *cbs[HK_MAX_CBS];
+ int nr;
+} housekeeping_cbs_table[HK_TYPE_MAX];
+
+/**
+ * housekeeping_register_cbs - Register explicit callbacks for a housekeeping type
+ * @type: Housekeeping type to register for
+ * @cbs: Callback structure containing pre_validate() and apply()
+ *
+ * Callbacks run in registration order when the mask for @type changes:
+ * pre_validate() before the RCU swap may reject the update; apply()
+ * after synchronize_rcu() reconfigures subsystem state.
+ *
+ * Return: 0 on success, -EINVAL if @type or @cbs is invalid,
+ * -ENOSPC if the per-type table is full.
+ */
+int housekeeping_register_cbs(enum hk_type type, struct housekeeping_cbs *cbs)
+{
+ if (type >= HK_TYPE_MAX || !cbs)
+ return -EINVAL;
+ if (housekeeping_cbs_table[type].nr >= HK_MAX_CBS)
+ return -ENOSPC;
+ housekeeping_cbs_table[type].cbs[housekeeping_cbs_table[type].nr++] = cbs;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(housekeeping_register_cbs);
+
+/**
+ * housekeeping_unregister_cbs - Remove previously registered callbacks
+ * @type: Housekeeping type
+ * @cbs: Callback structure to remove
+ *
+ * Return: 0 on success, -EINVAL if arguments are invalid,
+ * -ENOENT if @cbs was not registered.
+ */
+int housekeeping_unregister_cbs(enum hk_type type, struct housekeeping_cbs *cbs)
+{
+ int i;
+
+ if (type >= HK_TYPE_MAX || !cbs)
+ return -EINVAL;
+ for (i = 0; i < housekeeping_cbs_table[type].nr; i++) {
+ if (housekeeping_cbs_table[type].cbs[i] == cbs) {
+ housekeeping_cbs_table[type].cbs[i] =
+ housekeeping_cbs_table[type].cbs[--housekeeping_cbs_table[type].nr];
+ return 0;
+ }
+ }
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(housekeeping_unregister_cbs);
+
+static int housekeeping_pre_validate_cbs(enum hk_type type,
+ const struct cpumask *cur,
+ const struct cpumask *new)
+{
+ int i, ret;
+
+ for (i = 0; i < housekeeping_cbs_table[type].nr; i++) {
+ if (!housekeeping_cbs_table[type].cbs[i]->pre_validate)
+ continue;
+ ret = housekeeping_cbs_table[type].cbs[i]->pre_validate(type, cur, new);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
+static void housekeeping_apply_cbs(enum hk_type type)
+{
+ int i;
+
+ for (i = 0; i < housekeeping_cbs_table[type].nr; i++) {
+ if (housekeeping_cbs_table[type].cbs[i]->apply)
+ housekeeping_cbs_table[type].cbs[i]->apply(type);
+ }
+}
+
bool housekeeping_enabled(enum hk_type type)
{
return !!(READ_ONCE(housekeeping.flags) & BIT(type));
--
2.43.0
^ permalink raw reply related
* [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets
From: Jing Wu @ 2026-06-18 3:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
Shuah Khan, Thomas Gleixner
Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
Qiliang Yuan
This series introduces Dynamic Housekeeping Management (DHM) to the Linux
kernel, enabling runtime reconfiguration of kernel-noise housekeeping
(nohz_full tick suppression, RCU NOCB offloading, and managed IRQ
migration) through the existing cgroup v2 cpuset isolated partition
mechanism — no new kernel ABI required.
When a cpuset partition is set to isolated mode, the CPUs in that
partition are removed from the kernel's global housekeeping masks. The
housekeeping subsystems (tick/nohz, RCU NOCB, genirq) react via explicit
registered callbacks, applying the new masks at runtime. Destroying the
partition restores the CPUs to all housekeeping masks.
The architecture uses a per-type callback table (struct housekeeping_cbs)
with pre_validate/apply hooks, replacing the previous notifier chain.
Housekeeping cpumask pointers are RCU-protected to allow lock-free readers
during updates.
Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
V2 -> V3:
- Replace notifier chain with explicit per-type callback interface
(struct housekeeping_cbs with .name, .pre_validate, .apply fields).
- RCU-protect all housekeeping cpumask pointers; callers must hold
rcu_read_lock() or use housekeeping_cpumask_rcu() in apply() callbacks.
- Drop 5 patches from v2: HK_TYPE enum separation (upstream aliases are
already correct), no-op timer/hrtimer patches, kthread dead code, and
workqueue double-update.
- Fix deadlock in rcu_hk_workfn(): remove cpus_read_lock() wrapper around
remove_cpu()/add_cpu() which take cpu_hotplug_lock write side.
- Fix UAF in rcu_hk_apply(): snapshot the housekeeping cpumask inside the
work function under rcu_read_lock(), not at apply() time where the old
pointer may be freed by synchronize_rcu() before the work runs.
- Fix tick apply(): snapshot housekeeping_cpumask_rcu() under
rcu_read_lock() as required by lockdep for runtime-mutable types.
- Activate context_tracking dynamically via ct_cpu_track_user() /
ct_cpu_untrack_user() in tick apply(), eliminating the dependency on
CONFIG_CONTEXT_TRACKING_USER_FORCE flagged by tglx.
- Fix genirq apply(): snapshot HK_TYPE_MANAGED_IRQ mask under
rcu_read_lock() before the IRQ iteration loop.
- Simplify cpuset noise_types to BIT(HK_TYPE_KERNEL_NOISE) |
BIT(HK_TYPE_MANAGED_IRQ), replacing the redundant per-alias bitmask.
- housekeeping_update_types(): always use cpu_possible_mask as base
for HK_TYPE_KERNEL_NOISE, so de-isolation restores the mask to all
possible CPUs rather than leaving it at its last non-trivial value.
- Initialize watchdog_cpumask from HK_TYPE_KERNEL_NOISE (not
HK_TYPE_TIMER) at boot; keep it in sync at runtime via a new
housekeeping_cbs callback.
- Add kernel-noise selftest to test_cpuset_prs.sh, including
cpu_in_cpulist() for correct cpulist range membership detection and
nohz_full sysfs verification when CONFIG_NO_HZ_FULL is active.
- Add RCU caller fixes: sched/core (HK_TYPE_KERNEL_NOISE) and
drivers/hv (HK_TYPE_MANAGED_IRQ) are required because those types
are updated at runtime; hrtimer (HK_TYPE_TIMER) and arm64/topology
(HK_TYPE_TICK) are defensive fixes.
- Reorder patches so all subsystem callbacks are registered before the
cpuset patch that triggers housekeeping_update_types().
V1 -> V2:
- Rebrand series from DHEI to DHM (Dynamic Housekeeping Management).
- Drop custom sysfs interface entirely.
- Integrate housekeeping control into cgroup v2 cpuset isolated partition
mechanism.
- Add SMT-aware isolation constraints to prevent splitting SMT siblings.
- Add comprehensive documentation and cgroup functional selftests.
- Refactor mask transition logic to use RCU-safe handover.
v2: https://lore.kernel.org/r/20260413-wujing-dhm-v2-0-06df21caba5d@gmail.com
v1: https://lore.kernel.org/all/20260325-dhei-v12-final-v1-0-919cca23cadf@gmail.com
---
Jing Wu (13):
sched/isolation: Replace notifier chain with explicit callback interface
sched/isolation: Add housekeeping_update_types() for kernel-noise masks
sched/isolation: RCU-protect all housekeeping cpumask readers
sched/isolation: Fix RCU protection for runtime-mutable cpumask callers
cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
tick/nohz, context_tracking: Prepare for runtime nohz_full updates
rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling
genirq: Add explicit housekeeping callback for managed IRQ migration
watchdog/lockup_detector: Register housekeeping callback for kernel-noise
sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu
cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation
docs: cgroup-v2: Document kernel-noise isolation via isolated partitions
selftests/cgroup: Add kernel-noise isolation test to cpuset selftest
Documentation/admin-guide/cgroup-v2.rst | 8 +
arch/arm64/kernel/topology.c | 9 +-
drivers/hv/channel_mgmt.c | 50 +++--
include/linux/context_tracking.h | 1 +
include/linux/cpuhotplug.h | 2 +
include/linux/sched/isolation.h | 41 ++++
kernel/cgroup/cpuset.c | 23 +-
kernel/context_tracking.c | 23 +-
kernel/irq/manage.c | 86 ++++++++
kernel/rcu/tree.c | 104 +++++++++
kernel/sched/core.c | 7 +-
kernel/sched/isolation.c | 256 ++++++++++++++++++++--
kernel/time/hrtimer.c | 5 +-
kernel/time/tick-sched.c | 157 ++++++++++++-
kernel/watchdog.c | 56 ++++-
tools/testing/selftests/cgroup/test_cpuset_prs.sh | 204 ++++++++++++++++-
16 files changed, 968 insertions(+), 64 deletions(-)
---
base-commit: eb3f4b7426cfd2b79d65b7d37155480b32259a11
change-id: 20260408-wujing-dhm-8f43e2d49cd8
Best regards,
--
Jing Wu <realwujing@gmail.com>
^ permalink raw reply
* Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
From: YoungJun Park @ 2026-06-18 1:47 UTC (permalink / raw)
To: Nhat Pham
Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
baoquan.he, baohua, yosry, gunho.lee, taejoon.song, hyungjun.cho,
mkoutny, baver.bae, matia.kim
In-Reply-To: <CAKEwX=NfSy0XiD_UMsDOHGCwpE7sYmBmhV4Y9vk_cbnnr6J6PQ@mail.gmail.com>
On Wed, Jun 17, 2026 at 01:50:49PM -0400, Nhat Pham wrote:
> On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This is the v8 series of the swap tier patchset.
> >
> > Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> > The main change in this version is the interface change to use
> > memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> > This mechanism was suggested by Shakeel and Yosry
>
> I like this interface too :)
Good to hear. Now it looks like we have found a memcg interface that
aligns well with the existing memcg model.
I like this idea as well. Thanks again to Shakeel Butt and Yosry.
> > Here is a brief summary of our tentative conclusions. Please correct me
> > if anything is misrepresented (details in references):
> >
> > * Zswap tiering [2]:
> > Tiering applies only to the vswap + zswap combo. Zswap itself will
> > not be tiered, as the current architecture requires a physical device
> > for zswap allocation.
>
> I think Yosry wants zswap as a tier, right?
>
> Just that without vswap, maybe don't allow it to be an tier of itself?
With the current architecture, users cannot dynamically specify zswap as
a tier, and zswap is a separate layer, so it is not tiered by itself.
Once your vswap work lands, I think we can make the zswap
become the default, top-level tier.
After that, we can also look into cleaning up the zswap.writeback
interface together.
> #2: Inter-tier promotion and demotion:
> Promotion and demotion apply between tiers, not within a single
> tier. The current interface defines only tier assignment; it does
> not yet define when or how pages move between tiers. Two triggering
> models are possible:
>
> > (a) User-triggered: userspace explicitly initiates migration between
> > tiers (e.g. via a new interface or existing move_pages semantics).
> > (b) Kernel-triggered: the kernel moves pages between tiers at
> > appropriate points such as reclaim or refault.
>
> We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)
>
> Cold pages will fill up fast tiers first, and more recent/warm pages
> will land on slow tiers...
Yeah, good point!
> We'll also need to enforce isolation/fairness to make sure no wordload
> hoard the fast tiers too (but that probably requires demotion
> support).
Right, that makes sense.
BTW, One thing I am curious about, though, is whether there are strong
real-world use cases that require demotion/promotion.
Theoretically, this looks useful but it would be helpful to better understand
the requirements from such deployments.
> >
> > #3: Per-VMA, per-process swap and BPF:
> > Not just for memcg based swap, possible to extend Per-VMA or per-process
> > swap. Or we can use it as BPF program.
> >
> > #4: Zswap and vswap tiering:
> > Tiering applies to the vswap + zswap combination.
> >
> > #5: Vswap on/off control:
> > Currently not supported. If a strong use case arises where vswap needs
> > to be controlled by memcg, the tier interface could be used for it.
>
> +1.
>
> Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
> has a patch for it, IIUC, but if not it's pretty critical I'd say.
Yes, I missed it. Thank you for addressing it.
we need an implementation that integrates this with the per-CPU
allocation currently implemented on the vswap side.
If Kairui's patch lands, my patch #4 also can be optimized based on that.
> BTW, can we add some selftests, to make sure the new interface works
> as expected, and to have example programs for new users to model their
> scripts after? :)
Yes, I agree. I think selftests are necessary.
Do you want them to be introduced in this patchset, or would it be okay
to add them separately as follow-up work?
^ permalink raw reply
* Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
From: Nhat Pham @ 2026-06-17 17:50 UTC (permalink / raw)
To: Youngjun Park
Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
baoquan.he, baohua, yosry, gunho.lee, taejoon.song, hyungjun.cho,
mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-1-youngjun.park@lge.com>
On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This is the v8 series of the swap tier patchset.
>
> Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> The main change in this version is the interface change to use
> memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> This mechanism was suggested by Shakeel and Yosry
I like this interface too :)
>
> This change allows for future extensions to control swap
> between tiers and aligns better with existing memcg interfaces.
> Even with this memcg interface change, only patch #3 needed updates.
> Internally, patch #3 still uses the existing mask processing method
> (which is implementation-efficient), so only the user-facing interface
> was modified.
>
> We also discussed tier extensions. Thanks to Yosry, Nhat and Shakeel for their
> valuable feedback.
>
> Here is a brief summary of our tentative conclusions. Please correct me
> if anything is misrepresented (details in references):
>
> * Zswap tiering [2]:
> Tiering applies only to the vswap + zswap combo. Zswap itself will
> not be tiered, as the current architecture requires a physical device
> for zswap allocation.
I think Yosry wants zswap as a tier, right?
Just that without vswap, maybe don't allow it to be an tier of itself?
> * Vswap tiering [3]:
> Vswap should be handled transparently to the user. Vswap itself will
> not be tiered. But, someday supported if there is strong and real usecase.
> * Relationship with zswap.writeback [4]:
> If zswap tiering is introduced, it could replace the zswap-only tier.
> However, since zswap cannot be tiered independently, it is still
> needed for non-vswap cases. Separately, the internal logic could
> potentially be integrated into the tiering logic.
> * Tier demotion [5]:
> A separate interface like memory.swap.tiers.demotion might be needed.
> For now, we only support 0/max to enable/disable tiers. In the future,
> we could introduce an "auto" mode to automatically scale the limit
> based on swapfile size and memory.swap.max, similar to the direction
> memory tiering is heading in.
>
> I plan to apply the swap tier infrastructure and the first use case
> (cgroup-based swap control) first, and continue following up on the
> discussions above.
>
> Overview
> ========
>
> Swap Tiers group swap devices into performance classes (e.g. NVMe,
> HDD, Network) and allow per-memcg selection of which tiers to use.
> This mechanism was suggested by Chris Li.
>
>
> #2: Inter-tier promotion and demotion:
> Promotion and demotion apply between tiers, not within a single
> tier. The current interface defines only tier assignment; it does
> not yet define when or how pages move between tiers. Two triggering
> models are possible:
>
> (a) User-triggered: userspace explicitly initiates migration between
> tiers (e.g. via a new interface or existing move_pages semantics).
> (b) Kernel-triggered: the kernel moves pages between tiers at
> appropriate points such as reclaim or refault.
We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)
Cold pages will fill up fast tiers first, and more recent/warm pages
will land on slow tiers...
We'll also need to enforce isolation/fairness to make sure no wordload
hoard the fast tiers too (but that probably requires demotion
support).
>
> #3: Per-VMA, per-process swap and BPF:
> Not just for memcg based swap, possible to extend Per-VMA or per-process
> swap. Or we can use it as BPF program.
>
> #4: Zswap and vswap tiering:
> Tiering applies to the vswap + zswap combination.
>
> #5: Vswap on/off control:
> Currently not supported. If a strong use case arises where vswap needs
> to be controlled by memcg, the tier interface could be used for it.
+1.
Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
has a patch for it, IIUC, but if not it's pretty critical I'd say.
BTW, can we add some selftests, to make sure the new interface works
as expected, and to have example programs for new users to model their
scripts after? :)
^ permalink raw reply
* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-06-17 17:11 UTC (permalink / raw)
To: Yosry Ahmed
Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAO9r8zOg0OP1Ak1v7CRzSfQq0D8b4Dw+_T0Jui6YTM_KwQQNOA@mail.gmail.com>
On Tue, Jun 16, 2026 at 4:27 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Tue, Jun 16, 2026 at 1:24 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Ohh I thought you meant we shouldn't allow zswap to be a tier at all,
> not the *only* tier.
>
> > Or are you suggesting that if we set zswap as the only tier then we
> > can allocate from any swapfile (since we're not doing any IO anyway)?
>
> Hmm, technically having zswap as the only tier should be equivalent to
> disabling writeback, but you're right that if zswap is the only tier
> than the memcg is not allowed to use swap slots from any swapfile, so
> zswap cannot be used. Very good point :)
Yeah the coupling of swap/zswap makes reasoning about these kinds of
things so annoying. :)
If anything, with vswap, I'll stop having to explain to folks why they
have to provision on-disk swapfile when they only want to use
in-memory compressed swap, and that's a win in my book.
>
> In this case I think yes, we need vswap to be enabled to allow making
> zswap the only tier. That's one gap between zswap being the only tier
> and disabling zswap writeback, the former requires vswap while the
> latter doesn't.
Yup! Anyway, I think Youngjun sent out v8 - let's take a look.
^ permalink raw reply
* Re: [PATCH v3 15/15] mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
From: Suren Baghdasaryan @ 2026-06-17 15:10 UTC (permalink / raw)
To: Harry Yoo
Cc: Vlastimil Babka (SUSE), Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <918fae64-1323-46ea-a86e-3c847a52f174@kernel.org>
On Wed, Jun 17, 2026 at 7:37 AM Harry Yoo <harry@kernel.org> wrote:
>
>
>
> On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> > Finish the switch away from __GFP_NO_OBJ_EXT by replacing it with
> > SLAB_ALLOC_NO_RECURSE when allocating empty sheaves. Pass alloc_flags to
> > [__]alloc_empty_sheaf(). Callers that can't be part of a recursive
> > kmalloc() chain simply pass SLAB_ALLOC_DEFAULT. Use kmalloc_flags()
> > instead of kzalloc() for allocating the sheaf.
> >
> > With that we can finalize the removal the __GFP_NO_OBJ_EXT handling from
> > obj_ext allocations as well, leaving only SLAB_ALLOC_NO_RECURSE in
> > place.
> >
> > This leaves __GFP_NO_OBJ_EXT with no users in slab, so stop allowing the
> > flag in kmalloc_nolock().
> >
> > Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-16-7190909db118@kernel.org
> > Reviewed-by: Hao Li <hao.li@linux.dev>
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Looks good to me,
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply
* Re: [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Suren Baghdasaryan @ 2026-06-17 15:08 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <1bf749a4-1519-4d14-a0a7-6d8a56a6c850@kernel.org>
On Wed, Jun 17, 2026 at 7:57 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 6/17/26 16:36, Vlastimil Babka (SUSE) wrote:
> >
> >> With some comments below.
> >>
> >> I was worried that perhaps replacing SLAB_ALLOC_NO_RECURSE with
> >> __GFP_NO_OBJ_EXT will create a cycle of
> >>
> >> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
> >> -> kmalloc_flags(SLAB_ALLOC_NO_RECURSE)
> >> -> alloc_from_pcs(SLAB_ALLOC_NO_RECURSE)
> >> -> refill_objects(SLAB_ALLOC_DEFAULT)
> >> -> new_slab(SLAB_ALLOC_DEFAULT)
> >> -> account_slab(SLAB_ALLOC_DEFAULT)
> >> -> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
> >>
> >> with __GFP_NO_OBJ_EXT, it would have been passed to refill_objects(),
> >> but SLAB_ALLOC_NO_RECURSE is not. However this cycle does not exist
> >> because alloc_slab_obj_exts() clears __GFP_ACCOUNT (as part of
> >> OBJCG_CLEAR_MASK) and memory profiling itself does not invoke
> >> alloc_slab_obj_exts() when allocating new slabs if SLAB_ACCOUNT is not
> >> set (which is interesting, by the way).
> >
> > Hm yeah I think we should propagate alloc_flags to refill_objects() etc, to
> > avoid later surprise. But can be done as a later cleanup.
>
> It's also not a new hazard I think because while previously gfp flags with
> __GFP_NO_OBJ_EXT would could be propagated more thoroughly than alloc_flags
> for obj_exts only __alloc_tagging_slab_alloc_hook() looks at them, and
> alloc_slab_obj_exts() (from account_slab()) didn't either, so the amount of
> (finite) recursion is the same I think.
True but I think we should clean that up anyway after this change.
With the fixup you showed, LGTM.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> >> Also alloc_slab_obj_exts() propagating SLAB_ALLOC_NEW_SLAB to
> >> kmalloc_flags() is little bit confusing because it does not have any
> >> effect due to SLAB_ALLOC_NO_RECURSE.
> >
> > OK let's address this one by this fixup:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fc5b8c85b690..dc4b4ae874ce 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2164,6 +2164,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > {
> > const bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
> > unsigned int objects = objs_per_slab(s, slab);
> > + bool new_slab = alloc_flags & SLAB_ALLOC_NEW_SLAB;
> > unsigned long new_exts;
> > unsigned long old_exts;
> > struct slabobj_ext *vec;
> > @@ -2173,6 +2174,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > /* Prevent recursive extension vector allocation */
> > gfp |= __GFP_NO_OBJ_EXT;
> > alloc_flags |= SLAB_ALLOC_NO_RECURSE;
> > + alloc_flags &= ~SLAB_ALLOC_NEW_SLAB;
> >
> > sz = obj_exts_alloc_size(s, slab, gfp);
> >
> > @@ -2203,7 +2205,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > old_exts = READ_ONCE(slab->obj_exts);
> > handle_failed_objexts_alloc(old_exts, vec, objects);
> >
> > - if (alloc_flags & SLAB_ALLOC_NEW_SLAB) {
> > + if (new_slab) {
> > /*
> > * If the slab is brand new and nobody can yet access its
> > * obj_exts, no synchronization is required and obj_exts can
> >
>
^ permalink raw reply
* Re: [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Vlastimil Babka (SUSE) @ 2026-06-17 14:56 UTC (permalink / raw)
To: Harry Yoo
Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <26c29e4b-09b1-424a-b4e4-3358aac20115@kernel.org>
On 6/17/26 16:36, Vlastimil Babka (SUSE) wrote:
>
>> With some comments below.
>>
>> I was worried that perhaps replacing SLAB_ALLOC_NO_RECURSE with
>> __GFP_NO_OBJ_EXT will create a cycle of
>>
>> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
>> -> kmalloc_flags(SLAB_ALLOC_NO_RECURSE)
>> -> alloc_from_pcs(SLAB_ALLOC_NO_RECURSE)
>> -> refill_objects(SLAB_ALLOC_DEFAULT)
>> -> new_slab(SLAB_ALLOC_DEFAULT)
>> -> account_slab(SLAB_ALLOC_DEFAULT)
>> -> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
>>
>> with __GFP_NO_OBJ_EXT, it would have been passed to refill_objects(),
>> but SLAB_ALLOC_NO_RECURSE is not. However this cycle does not exist
>> because alloc_slab_obj_exts() clears __GFP_ACCOUNT (as part of
>> OBJCG_CLEAR_MASK) and memory profiling itself does not invoke
>> alloc_slab_obj_exts() when allocating new slabs if SLAB_ACCOUNT is not
>> set (which is interesting, by the way).
>
> Hm yeah I think we should propagate alloc_flags to refill_objects() etc, to
> avoid later surprise. But can be done as a later cleanup.
It's also not a new hazard I think because while previously gfp flags with
__GFP_NO_OBJ_EXT would could be propagated more thoroughly than alloc_flags
for obj_exts only __alloc_tagging_slab_alloc_hook() looks at them, and
alloc_slab_obj_exts() (from account_slab()) didn't either, so the amount of
(finite) recursion is the same I think.
>> Also alloc_slab_obj_exts() propagating SLAB_ALLOC_NEW_SLAB to
>> kmalloc_flags() is little bit confusing because it does not have any
>> effect due to SLAB_ALLOC_NO_RECURSE.
>
> OK let's address this one by this fixup:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index fc5b8c85b690..dc4b4ae874ce 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2164,6 +2164,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> {
> const bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
> unsigned int objects = objs_per_slab(s, slab);
> + bool new_slab = alloc_flags & SLAB_ALLOC_NEW_SLAB;
> unsigned long new_exts;
> unsigned long old_exts;
> struct slabobj_ext *vec;
> @@ -2173,6 +2174,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> /* Prevent recursive extension vector allocation */
> gfp |= __GFP_NO_OBJ_EXT;
> alloc_flags |= SLAB_ALLOC_NO_RECURSE;
> + alloc_flags &= ~SLAB_ALLOC_NEW_SLAB;
>
> sz = obj_exts_alloc_size(s, slab, gfp);
>
> @@ -2203,7 +2205,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> old_exts = READ_ONCE(slab->obj_exts);
> handle_failed_objexts_alloc(old_exts, vec, objects);
>
> - if (alloc_flags & SLAB_ALLOC_NEW_SLAB) {
> + if (new_slab) {
> /*
> * If the slab is brand new and nobody can yet access its
> * obj_exts, no synchronization is required and obj_exts can
>
^ permalink raw reply
* Re: [PATCH v3 11/15] mm/slab: pass slab_alloc_context to __do_kmalloc_node()
From: Suren Baghdasaryan @ 2026-06-17 14:52 UTC (permalink / raw)
To: Harry Yoo
Cc: Vlastimil Babka (SUSE), Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <e499bab7-9217-4bac-848c-fb1472cd2c00@kernel.org>
On Wed, Jun 17, 2026 at 2:36 AM Harry Yoo <harry@kernel.org> wrote:
>
>
>
> On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> > With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
> > alloc flag that prevents kmalloc recursion. For that we need a version
> > of kmalloc() that takes alloc_flags and use it in places that perform
> > these potentially recursive kmalloc allocations (of sheaves or obj_ext
> > arrays).
> >
> > As a preparatory step, make __do_kmalloc_node() take a pointer to
> > slab_alloc_context. This replaces the 'size' and 'caller' parameters and
> > includes alloc_flags which we'll make use of.
> >
> > Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-12-7190909db118@kernel.org
> > Reviewed-by: Hao Li <hao.li@linux.dev>
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Looks good to me,
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply
* Re: [PATCH v3 08/15] mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
From: Suren Baghdasaryan @ 2026-06-17 14:48 UTC (permalink / raw)
To: Harry Yoo
Cc: Vlastimil Babka (SUSE), Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <e9beb94a-6508-4bd0-b641-41e718990f7b@kernel.org>
On Tue, Jun 16, 2026 at 12:37 AM Harry Yoo <harry@kernel.org> wrote:
>
>
>
> On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> > Convert the whole following call stack to pass either slab_alloc_context
> > (thus including alloc_flags) or just alloc_flags as necessary:
> >
> > slab_post_alloc_hook()
> > alloc_tagging_slab_alloc_hook()
> > __alloc_tagging_slab_alloc_hook()
> > prepare_slab_obj_exts_hook()
> > alloc_slab_obj_exts()
> > memcg_slab_post_alloc_hook()
> > __memcg_slab_post_alloc_hook()
> > alloc_slab_obj_exts()
> >
> > Converting all these at once avoids unnecessary churn and is mostly
> > mechanical.
> >
> > This ultimately allows to decide if spinning is allowed using
> > alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook().
> > Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing
> > else in slab itself relying on gfpflags_allow_spinning() which can
> > be false even if not called from kmalloc_nolock().
> >
> > A followup change will also use the alloc_flags availability in the call
> > stack above to remove the __GFP_NO_OBJ_EXT flag.
> >
> > For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab"
> > parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality.
> >
> > To further reduce the number of parameters of slab_post_alloc_hook(),
> > also make 'struct list_lru *lru' (which is NULL for most callers) a new
> > field of slab_alloc_context.
> >
> > Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-9-7190909db118@kernel.org
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
>
> Looks good to me,
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply
* Re: [PATCH v3 06/15] mm/slab: add alloc_flags to slab_alloc_context
From: Suren Baghdasaryan @ 2026-06-17 14:40 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-6-ce1146d140fb@kernel.org>
On Mon, Jun 15, 2026 at 4:55 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> Add alloc_flags as a new field to the slab_alloc_context helper struct,
> so we can pass it to more functions in the slab implementation without
> adding another function parameter.
>
> Start checking them via alloc_flags_allow_spinning() in
> alloc_single_from_new_slab() (where we can drop the allow_spin
> parameter), ___slab_alloc(), get_from_partial_node() and
> get_from_any_partial(). This further reduces false-positive
> spinning-not-allowed from allocations that are not kmalloc_nolock() but
> lack __GFP_RECLAIM flags.
>
> _kmalloc_nolock_noprof() initializes ac.alloc_flags using its flags that
> are SLAB_ALLOC_NOLOCK. slab_alloc_node() and __kmem_cache_alloc_bulk()
> are not reachable from kmalloc_nolock() and all their callers expect
> spinning to be allowed, so they can use SLAB_ALLOC_DEFAULT. This is
> temporary as the scope of slab_alloc_context will further move to the
> callers, making the alloc_flags usage more obvious.
>
> Also change how trynode_flags are constructed in ___slab_alloc() to
> achieve the same "do not upgrade to GFP_NOWAIT" by using masking instead
> of checking allow_spin. We need to do that because we now determine
> allow_spin from alloc_flags, and would otherwise start to upgrade e.g.
> kmalloc() allocations without __GFP_KSWAPD_RECLAIM (that however do
> allow spinning) to GFP_NOWAIT, thus including __GFP_KSWAPD_RECLAIM.
>
> During the masking keep also existing __GFP_NOMEMALLOC (pointed out by
> Sashiko) and __GFP_ACCOUNT. Previously the hardcoded GFP_NOWAIT would
> eliminate them, but it's not a big problem that would need a separate
> fix.
>
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-6-7190909db118@kernel.org
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
> mm/slub.c | 28 +++++++++++++++-------------
> 1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 6f6c15d796e1..3a34907b881b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -217,6 +217,7 @@ static DEFINE_STATIC_KEY_FALSE(strict_numa);
> struct slab_alloc_context {
> unsigned long caller_addr;
> size_t orig_size;
> + unsigned int alloc_flags;
> };
>
> /* Structure holding parameters for get_partial_node_bulk() */
> @@ -3687,9 +3688,9 @@ static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
> * and put the slab to the partial (or full) list.
> */
> static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> - const struct slab_alloc_context *ac,
> - bool allow_spin)
> + const struct slab_alloc_context *ac)
> {
> + bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
> struct kmem_cache_node *n;
> struct slab_obj_iter iter;
> bool needs_add_partial;
> @@ -3835,7 +3836,7 @@ static void *get_from_partial_node(struct kmem_cache *s,
> if (!n || !n->nr_partial)
> return NULL;
>
> - if (gfpflags_allow_spinning(gfp_flags))
> + if (alloc_flags_allow_spinning(ac->alloc_flags))
> spin_lock_irqsave(&n->list_lock, flags);
> else if (!spin_trylock_irqsave(&n->list_lock, flags))
> return NULL;
> @@ -3891,7 +3892,7 @@ static void *get_from_any_partial(struct kmem_cache *s, gfp_t gfp_flags,
> struct zone *zone;
> enum zone_type highest_zoneidx = gfp_zone(gfp_flags);
> unsigned int cpuset_mems_cookie;
> - bool allow_spin = gfpflags_allow_spinning(gfp_flags);
> + bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
>
> /*
> * The defrag ratio allows a configuration of the tradeoffs between
> @@ -4449,7 +4450,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> const struct slab_alloc_context *ac)
> {
> - bool allow_spin = gfpflags_allow_spinning(gfpflags);
> + bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
> gfp_t trynode_flags;
> void *object;
> struct slab *slab;
> @@ -4466,18 +4467,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> * 1) try to get a partial slab from target node only by having
> * __GFP_THISNODE in trynode_flags for get_from_partial()
> * 2) if 1) failed, try to allocate a new slab from target node with
> - * GPF_NOWAIT | __GFP_THISNODE opportunistically
> + * (at most) GFP_NOWAIT | __GFP_THISNODE opportunistically
> * 3) if 2) failed, retry with original gfpflags which will allow
> * get_from_partial() try partial lists of other nodes before
> * potentially allocating new page from other nodes
> */
> if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
> && try_thisnode)) {
> - if (unlikely(!allow_spin))
> - /* Do not upgrade gfp to NOWAIT from more restrictive mode */
> - trynode_flags = gfpflags | __GFP_THISNODE;
> - else
> - trynode_flags = GFP_NOWAIT | __GFP_THISNODE;
> + trynode_flags &= GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_ACCOUNT;
> + trynode_flags |= __GFP_NOWARN | __GFP_THISNODE;
> }
>
> object = get_from_partial(s, node, trynode_flags, ac);
> @@ -4499,7 +4497,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> stat(s, ALLOC_SLAB);
>
> if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> - object = alloc_single_from_new_slab(s, slab, ac, allow_spin);
> + object = alloc_single_from_new_slab(s, slab, ac);
>
> if (likely(object))
> goto success;
> @@ -4918,6 +4916,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
> static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
> gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> {
> + const unsigned int alloc_flags = SLAB_ALLOC_DEFAULT;
> void *object;
>
> s = slab_pre_alloc_hook(s, gfpflags);
> @@ -4928,12 +4927,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
> if (unlikely(object))
> goto out;
>
> - object = alloc_from_pcs(s, gfpflags, SLAB_ALLOC_DEFAULT, node);
> + object = alloc_from_pcs(s, gfpflags, alloc_flags, node);
>
> if (unlikely(!object)) {
> const struct slab_alloc_context ac = {
> .caller_addr = addr,
> .orig_size = orig_size,
> + .alloc_flags = alloc_flags,
> };
> object = __slab_alloc_node(s, gfpflags, node, &ac);
> }
> @@ -5366,6 +5366,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
> const struct slab_alloc_context ac = {
> .caller_addr = _RET_IP_,
> .orig_size = orig_size,
> + .alloc_flags = alloc_flags,
> };
>
> VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
> @@ -7254,6 +7255,7 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> const struct slab_alloc_context ac = {
> .caller_addr = _RET_IP_,
> .orig_size = s->object_size,
> + .alloc_flags = SLAB_ALLOC_DEFAULT,
> };
> for (i = 0; i < size; i++) {
>
>
> --
> 2.54.0
>
^ permalink raw reply
* Re: [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Harry Yoo @ 2026-06-17 14:40 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <26c29e4b-09b1-424a-b4e4-3358aac20115@kernel.org>
[-- Attachment #1.1: Type: text/plain, Size: 3475 bytes --]
On 6/17/26 11:36 PM, Vlastimil Babka (SUSE) wrote:
> On 6/17/26 15:56, Harry Yoo wrote:
>> On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
>>> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
>>> gfp flags are a scarce resource, unlike slab's alloc_flags.
>>>
>>> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
>>> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
>>> family function should not recurse into another kmalloc*() for the
>>> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
>>>
>>> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
>>> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
>>> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
>>> added. This will also pass through SLAB_ALLOC_NOLOCK so we don't need
>>> to special case kmalloc_nolock() anymore.
>>>
>>> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
>>> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
>>> the incoming gfp flags (only augmented with __GFP_ZERO), because if
>>> alloc_flags contain SLAB_ALLOC_NOLOCK, the incoming gfp flags have to
>>> be also compatible with it. However, we might have added __GFP_THISNODE
>>> for opportunistic slab allocation, as pointed out by Hao Li, and
>>> __GFP_COMP by allocate_slab() as pointed out by Shengming Hu. Solve this
>>> by adding both flags to OBJCGS_CLEAR_MASK as it makes sense to strip
>>> them anyway for non-kmalloc_nolock() allocations of sheaves or obj_ext
>>> arrays as well.
>>>
>>> To avoid recursion of sheaf -> obj_ext -> sheaf -> ... allocations at
>>> this patch, until the next patch converts sheaves to
>>> SLAB_ALLOC_NO_RECURSE, use both gfp and alloc_flags for obj_ext. The
>>> next patch will remove the gfp part.
>>>
>>> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org
>>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>> ---
>>
>> Looks good to me,
>> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
>
> Thanks!
>
>> With some comments below.
>>
>> I was worried that perhaps replacing SLAB_ALLOC_NO_RECURSE with
>> __GFP_NO_OBJ_EXT will create a cycle of
>>
>> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
>> -> kmalloc_flags(SLAB_ALLOC_NO_RECURSE)
>> -> alloc_from_pcs(SLAB_ALLOC_NO_RECURSE)
>> -> refill_objects(SLAB_ALLOC_DEFAULT)
>> -> new_slab(SLAB_ALLOC_DEFAULT)
>> -> account_slab(SLAB_ALLOC_DEFAULT)
>> -> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
>>
>> with __GFP_NO_OBJ_EXT, it would have been passed to refill_objects(),
>> but SLAB_ALLOC_NO_RECURSE is not. However this cycle does not exist
>> because alloc_slab_obj_exts() clears __GFP_ACCOUNT (as part of
>> OBJCG_CLEAR_MASK) and memory profiling itself does not invoke
>> alloc_slab_obj_exts() when allocating new slabs if SLAB_ACCOUNT is not
>> set (which is interesting, by the way).
>
> Hm yeah I think we should propagate alloc_flags to refill_objects() etc, to
> avoid later surprise. But can be done as a later cleanup.
Ack.
>> Also alloc_slab_obj_exts() propagating SLAB_ALLOC_NEW_SLAB to
>> kmalloc_flags() is little bit confusing because it does not have any
>> effect due to SLAB_ALLOC_NO_RECURSE.
>
> OK let's address this one by this fixup:
The fixup looks good to me, thanks!
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v2 07/16] mm/slab: replace struct partial_context with slab_alloc_context
From: Suren Baghdasaryan @ 2026-06-17 14:39 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <1c63fbca-6ee4-466f-bfb5-5ff25a847607@kernel.org>
On Mon, Jun 15, 2026 at 3:01 AM Vlastimil Babka (SUSE)
<vbabka@kernel.org> wrote:
>
> On 6/15/26 04:36, Suren Baghdasaryan wrote:
> > On Wed, Jun 10, 2026 at 11:05 PM Harry Yoo <harry@kernel.org> wrote:
> >>
> >>
> >>
> >> On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> >> > Refactor get_from_partial_node(), get_from_any_partial(),
> >> > get_from_partial() and ___slab_alloc().
> >> >
> >> > Remove struct partial_context, which used to be more substantial but
> >> > shrank as part of the sheaves conversion. Instead pass gfp_flags and
> >> > pointer to the new slab_alloc_context, which together is a superset of
> >> > partial_context.
> >> >
> >> > This means alloc_flags are now available and we can use them to
> >> > determine if spinning is allowed, further reducing false positive "not
> >> > allowed" in the slow path due to gfp flags lacking __GFP_RECLAIM.
> >> >
> >> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >> > ---
> >>
> >> Looks good to me,
> >> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> >
> > Ah, nice! The conversion I was anticipating in the previous patch...
> > I would do this removal of partial_context as patch 6 and then convert
> > ___slab_alloc() and get_from_any_partial*() altogether in patch 7. I
> > think that would keep the behavior of the ___slab_alloc() more robust
> > throughout the patchset. But I would say it's nice to have, not a
> > must-have.
>
> OK, so I switched the order of 6 7 and all the changes from
> gfpflags_allow_spinning() to alloc_flags_allow_spinning are now in the
> newly-later patch; the "replace struct partial_context with
> slab_alloc_context" part has no functional changes. Verified that the end
> result is exactly the same, and only updated changelogs a bit.
Thanks for the refactoring. LGTM.
>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> Thanks!
>
> >>
> >> --
> >> Cheers,
> >> Harry / Hyeonggon
>
^ permalink raw reply
* Re: [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Vlastimil Babka (SUSE) @ 2026-06-17 14:36 UTC (permalink / raw)
To: Harry Yoo
Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <78b67a9b-44e5-4649-957a-9d42bfaa098e@kernel.org>
On 6/17/26 15:56, Harry Yoo wrote:
>
>
> On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
>> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
>> gfp flags are a scarce resource, unlike slab's alloc_flags.
>>
>> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
>> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
>> family function should not recurse into another kmalloc*() for the
>> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
>>
>> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
>> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
>> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
>> added. This will also pass through SLAB_ALLOC_NOLOCK so we don't need
>> to special case kmalloc_nolock() anymore.
>>
>> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
>> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
>> the incoming gfp flags (only augmented with __GFP_ZERO), because if
>> alloc_flags contain SLAB_ALLOC_NOLOCK, the incoming gfp flags have to
>> be also compatible with it. However, we might have added __GFP_THISNODE
>> for opportunistic slab allocation, as pointed out by Hao Li, and
>> __GFP_COMP by allocate_slab() as pointed out by Shengming Hu. Solve this
>> by adding both flags to OBJCGS_CLEAR_MASK as it makes sense to strip
>> them anyway for non-kmalloc_nolock() allocations of sheaves or obj_ext
>> arrays as well.
>>
>> To avoid recursion of sheaf -> obj_ext -> sheaf -> ... allocations at
>> this patch, until the next patch converts sheaves to
>> SLAB_ALLOC_NO_RECURSE, use both gfp and alloc_flags for obj_ext. The
>> next patch will remove the gfp part.
>>
>> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>
> Looks good to me,
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Thanks!
> With some comments below.
>
> I was worried that perhaps replacing SLAB_ALLOC_NO_RECURSE with
> __GFP_NO_OBJ_EXT will create a cycle of
>
> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
> -> kmalloc_flags(SLAB_ALLOC_NO_RECURSE)
> -> alloc_from_pcs(SLAB_ALLOC_NO_RECURSE)
> -> refill_objects(SLAB_ALLOC_DEFAULT)
> -> new_slab(SLAB_ALLOC_DEFAULT)
> -> account_slab(SLAB_ALLOC_DEFAULT)
> -> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
>
> with __GFP_NO_OBJ_EXT, it would have been passed to refill_objects(),
> but SLAB_ALLOC_NO_RECURSE is not. However this cycle does not exist
> because alloc_slab_obj_exts() clears __GFP_ACCOUNT (as part of
> OBJCG_CLEAR_MASK) and memory profiling itself does not invoke
> alloc_slab_obj_exts() when allocating new slabs if SLAB_ACCOUNT is not
> set (which is interesting, by the way).
Hm yeah I think we should propagate alloc_flags to refill_objects() etc, to
avoid later surprise. But can be done as a later cleanup.
> Also alloc_slab_obj_exts() propagating SLAB_ALLOC_NEW_SLAB to
> kmalloc_flags() is little bit confusing because it does not have any
> effect due to SLAB_ALLOC_NO_RECURSE.
OK let's address this one by this fixup:
diff --git a/mm/slub.c b/mm/slub.c
index fc5b8c85b690..dc4b4ae874ce 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2164,6 +2164,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
{
const bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
unsigned int objects = objs_per_slab(s, slab);
+ bool new_slab = alloc_flags & SLAB_ALLOC_NEW_SLAB;
unsigned long new_exts;
unsigned long old_exts;
struct slabobj_ext *vec;
@@ -2173,6 +2174,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
/* Prevent recursive extension vector allocation */
gfp |= __GFP_NO_OBJ_EXT;
alloc_flags |= SLAB_ALLOC_NO_RECURSE;
+ alloc_flags &= ~SLAB_ALLOC_NEW_SLAB;
sz = obj_exts_alloc_size(s, slab, gfp);
@@ -2203,7 +2205,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
old_exts = READ_ONCE(slab->obj_exts);
handle_failed_objexts_alloc(old_exts, vec, objects);
- if (alloc_flags & SLAB_ALLOC_NEW_SLAB) {
+ if (new_slab) {
/*
* If the slab is brand new and nobody can yet access its
* obj_exts, no synchronization is required and obj_exts can
^ permalink raw reply related
* Re: [PATCH v3 15/15] mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
From: Harry Yoo @ 2026-06-17 14:36 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-15-ce1146d140fb@kernel.org>
[-- Attachment #1.1: Type: text/plain, Size: 972 bytes --]
On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> Finish the switch away from __GFP_NO_OBJ_EXT by replacing it with
> SLAB_ALLOC_NO_RECURSE when allocating empty sheaves. Pass alloc_flags to
> [__]alloc_empty_sheaf(). Callers that can't be part of a recursive
> kmalloc() chain simply pass SLAB_ALLOC_DEFAULT. Use kmalloc_flags()
> instead of kzalloc() for allocating the sheaf.
>
> With that we can finalize the removal the __GFP_NO_OBJ_EXT handling from
> obj_ext allocations as well, leaving only SLAB_ALLOC_NO_RECURSE in
> place.
>
> This leaves __GFP_NO_OBJ_EXT with no users in slab, so stop allowing the
> flag in kmalloc_nolock().
>
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-16-7190909db118@kernel.org
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-17 14:03 UTC (permalink / raw)
To: Balbir Singh
Cc: David Hildenbrand (Arm), lsf-pc, linux-kernel, linux-cxl, cgroups,
linux-mm, linux-trace-kernel, damon, kernel-team, gregkh, rafael,
dakr, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ajIb4DJdLGPbMB4V@parvat>
On Wed, Jun 17, 2026 at 02:02:47PM +1000, Balbir Singh wrote:
> On Wed, Jun 10, 2026 at 12:37:34PM -0400, Gregory Price wrote:
> > On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> > > On 6/10/26 12:41, Gregory Price wrote:
> > > > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > >
> >
> > For mm/slub.c we can choose to do one of thwo things
> >
> > 1) 100% refuse slab allocations on private nodes, i.e.:
> >
> > kmalloc_node(..., private_nid, __GFP_THISNODE)
> >
> > And will fail (return NULL).
> >
>
> Doesn't this iterate through N_MEMORY only? N_MEMORY_PRIVATE should not
> be in the regular for_each(...) loops
>
If a node is in neither FALLBACK nor NOFALLBACK - it is *completely*
unreachable in the current page allocator.
Next RFC I've reduced this to create a ZONELIST_PRIVATE separate from
the ZONELIST_FALLBACK and ZONELIST_NOFALLBACK, and an explicit folio
allocation interface that selects which fallback list to use.
the feedback in the past week has been helpful in honing in on a
solution that I think is generalizable. Have just been taking the time
to test various behaviors to make sure I haven't been regressing any
userland API/ABIs (mbind, mempolicy, etc).
~Gregory
^ permalink raw reply
* Re: [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Harry Yoo @ 2026-06-17 13:56 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-14-ce1146d140fb@kernel.org>
[-- Attachment #1.1: Type: text/plain, Size: 3107 bytes --]
On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
> gfp flags are a scarce resource, unlike slab's alloc_flags.
>
> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
> family function should not recurse into another kmalloc*() for the
> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
>
> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
> added. This will also pass through SLAB_ALLOC_NOLOCK so we don't need
> to special case kmalloc_nolock() anymore.
>
> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
> the incoming gfp flags (only augmented with __GFP_ZERO), because if
> alloc_flags contain SLAB_ALLOC_NOLOCK, the incoming gfp flags have to
> be also compatible with it. However, we might have added __GFP_THISNODE
> for opportunistic slab allocation, as pointed out by Hao Li, and
> __GFP_COMP by allocate_slab() as pointed out by Shengming Hu. Solve this
> by adding both flags to OBJCGS_CLEAR_MASK as it makes sense to strip
> them anyway for non-kmalloc_nolock() allocations of sheaves or obj_ext
> arrays as well.
>
> To avoid recursion of sheaf -> obj_ext -> sheaf -> ... allocations at
> this patch, until the next patch converts sheaves to
> SLAB_ALLOC_NO_RECURSE, use both gfp and alloc_flags for obj_ext. The
> next patch will remove the gfp part.
>
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
With some comments below.
I was worried that perhaps replacing SLAB_ALLOC_NO_RECURSE with
__GFP_NO_OBJ_EXT will create a cycle of
alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
-> kmalloc_flags(SLAB_ALLOC_NO_RECURSE)
-> alloc_from_pcs(SLAB_ALLOC_NO_RECURSE)
-> refill_objects(SLAB_ALLOC_DEFAULT)
-> new_slab(SLAB_ALLOC_DEFAULT)
-> account_slab(SLAB_ALLOC_DEFAULT)
-> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
with __GFP_NO_OBJ_EXT, it would have been passed to refill_objects(),
but SLAB_ALLOC_NO_RECURSE is not. However this cycle does not exist
because alloc_slab_obj_exts() clears __GFP_ACCOUNT (as part of
OBJCG_CLEAR_MASK) and memory profiling itself does not invoke
alloc_slab_obj_exts() when allocating new slabs if SLAB_ACCOUNT is not
set (which is interesting, by the way).
Also alloc_slab_obj_exts() propagating SLAB_ALLOC_NEW_SLAB to
kmalloc_flags() is little bit confusing because it does not have any
effect due to SLAB_ALLOC_NO_RECURSE.
Those are quite subtle and perhaps worth some attention.
But technically, should not be a blocker for the patch.
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v7 0/8] selftests/cgroup: improve zswap tests robustness and support large page sizes
From: Michal Koutný @ 2026-06-17 12:28 UTC (permalink / raw)
To: Li Wang, tj
Cc: akpm, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
linux-kselftest, linux-kernel
In-Reply-To: <20260424040059.12940-1-li.wang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]
On Fri, Apr 24, 2026 at 12:00:51PM +0800, Li Wang <li.wang@linux.dev> wrote:
> This patchset aims to fix various spurious failures and improve the overall
> robustness of the cgroup zswap selftests.
>
> The primary motivation is to make the tests compatible with architectures
> that use non-4K page sizes (such as 64K on ppc64le and arm64). Currently,
> the tests rely heavily on hardcoded 4K page sizes and fixed memory limits.
> On 64K page size systems, these hardcoded values lead to sub-page granularity
> accesses, incorrect page count calculations, and insufficient memory pressure
> to trigger zswap writeback, ultimately causing the tests to fail.
>
> Additionally, this series addresses OOM kills occurring in test_swapin_nozswap
> by dynamically scaling memory limits, and prevents spurious test failures
> when zswap is built into the kernel but globally disabled.
>
> Changes in v7:
> Replace my work email by li.wang@linux.dev address.
> Add Acked-by: Nhat Pham <nphamcs@gmail.com> to series.
> Rebase to the latest branch (only one tiny conflict resolved).
I think the patches from the series where I had no special remarks can
be applied already (and base next (smaller) series on that).
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v7 6/8] selftest/cgroup: fix zswap test_no_invasive_cgroup_shrink on large pagesize system
From: Michal Koutný @ 2026-06-17 12:27 UTC (permalink / raw)
To: Li Wang
Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
Shakeel Butt, Yosry Ahmed
In-Reply-To: <20260424040059.12940-7-li.wang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]
On Fri, Apr 24, 2026 at 12:00:57PM +0800, Li Wang <li.wang@linux.dev> wrote:
> test_no_invasive_cgroup_shrink sets up two cgroups: wb_group, which is
> expected to trigger zswap writeback, and a control group (renamed to
> zw_group),
Aha, it should stand for zswap writeback? Then zwb_group to avoid (my)
confusion with zsw_group :-)
Although the original names were already well descriptive (both groups
are expected to have some zswap).
> which should only have pages sitting in zswap without any
> writeback.
>
> There are two problems with the current test:
>
> 1) The data patterns are reversed. wb_group uses allocate_bytes(), which
> writes only a single byte per page — trivially compressible,
> especially by zstd — so compressed pages fit within zswap.max and
> writeback is never triggered. Meanwhile, the control group uses
> getrandom() to produce hard-to-compress data, but it is the group
> that does *not* need writeback.
>
> 2) The test uses fixed sizes (10K zswap.max, 10MB allocation) that are
> too small on systems with large PAGE_SIZE (e.g. 64K), failing to
> build enough memory pressure to trigger writeback reliably.
>
> Fix both issues by:
> - Swapping the data patterns: fill wb_group pages with partially
> random data (getrandom for page_size/4 bytes) to resist compression
> and trigger writeback, and fill zw_group pages with simple repeated
> data to stay compressed in zswap.
I'd have expected that having both equal (i.e. both random to fill up
more easily) is what tests the effect zswap.max upon writeback most
precisely.
> - Making all size parameters PAGE_SIZE-aware: set allocation size to
> PAGE_SIZE * 1024, memory.zswap.max to PAGE_SIZE, and memory.max to
> allocation_size / 2 for both cgroups.
Makes sense.
> - Allocating memory inline instead of via cg_run() so the pages
> remain resident throughout the test.
What is the residency good for? (It doesn't matter AFAICS, so the change
seems gratuitous and code diverges from test_zswap_usage().)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v7 4/8] selftests/cgroup: rename PAGE_SIZE to BUF_SIZE in cgroup_util
From: Michal Koutný @ 2026-06-17 12:26 UTC (permalink / raw)
To: Li Wang
Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
Shakeel Butt, Yosry Ahmed
In-Reply-To: <20260424040059.12940-5-li.wang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 7357 bytes --]
On Fri, Apr 24, 2026 at 12:00:55PM +0800, Li Wang <li.wang@linux.dev> wrote:
> The cgroup utility code defines a local PAGE_SIZE macro hardcoded to
> 4096, which is used primarily as a generic buffer size for reading cgroup
> and proc files. This naming is misleading because the value has nothing
> to do with the actual page size of the system. On architectures with larger
> pages (e.g., 64K on arm64 or ppc64), the name suggests a relationship that
> does not exist. Additionally, the name can shadow or conflict with PAGE_SIZE
> definitions from system headers, leading to confusion or subtle bugs.
>
> To resolve this, rename the macro to BUF_SIZE to accurately reflect its
> purpose as a general I/O buffer size.
>
> Furthermore, test_memcontrol currently relies on this hardcoded 4K value
> to stride through memory and trigger page faults. Update this logic to
> use the actual system page size dynamically. This micro-optimizes the
> memory faulting process by ensuring it iterates correctly and efficiently
> based on the underlying architecture's true page size. (This part from Waiman)
>
> Signed-off-by: Li Wang <li.wang@linux.dev>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> ---
> .../selftests/cgroup/lib/cgroup_util.c | 18 +++++++++---------
> .../cgroup/lib/include/cgroup_util.h | 4 ++--
> tools/testing/selftests/cgroup/test_core.c | 2 +-
> tools/testing/selftests/cgroup/test_freezer.c | 2 +-
> .../selftests/cgroup/test_memcontrol.c | 19 ++++++++++++-------
> 5 files changed, 25 insertions(+), 20 deletions(-)
>
> diff --git a/tools/testing/selftests/cgroup/lib/cgroup_util.c b/tools/testing/selftests/cgroup/lib/cgroup_util.c
> index 6a7295347e9..9be8ac25657 100644
> --- a/tools/testing/selftests/cgroup/lib/cgroup_util.c
> +++ b/tools/testing/selftests/cgroup/lib/cgroup_util.c
> @@ -140,7 +140,7 @@ int cg_read_strcmp_wait(const char *cgroup, const char *control,
>
> int cg_read_strstr(const char *cgroup, const char *control, const char *needle)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
>
> if (cg_read(cgroup, control, buf, sizeof(buf)))
> return -1;
> @@ -170,7 +170,7 @@ long cg_read_long_fd(int fd)
>
> long cg_read_key_long(const char *cgroup, const char *control, const char *key)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
> char *ptr;
>
> if (cg_read(cgroup, control, buf, sizeof(buf)))
> @@ -206,7 +206,7 @@ long cg_read_key_long_poll(const char *cgroup, const char *control,
>
> long cg_read_lc(const char *cgroup, const char *control)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
> const char delim[] = "\n";
> char *line;
> long cnt = 0;
> @@ -258,7 +258,7 @@ int cg_write_numeric(const char *cgroup, const char *control, long value)
> static int cg_find_root(char *root, size_t len, const char *controller,
> bool *nsdelegate)
> {
> - char buf[10 * PAGE_SIZE];
> + char buf[10 * BUF_SIZE];
> char *fs, *mount, *type, *options;
> const char delim[] = "\n\t ";
>
> @@ -313,7 +313,7 @@ int cg_create(const char *cgroup)
>
> int cg_wait_for_proc_count(const char *cgroup, int count)
> {
> - char buf[10 * PAGE_SIZE] = {0};
> + char buf[10 * BUF_SIZE] = {0};
> int attempts;
> char *ptr;
>
> @@ -338,7 +338,7 @@ int cg_wait_for_proc_count(const char *cgroup, int count)
>
> int cg_killall(const char *cgroup)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
> char *ptr = buf;
>
> /* If cgroup.kill exists use it. */
> @@ -548,7 +548,7 @@ int cg_run_nowait(const char *cgroup,
>
> int proc_mount_contains(const char *option)
> {
> - char buf[4 * PAGE_SIZE];
> + char buf[4 * BUF_SIZE];
> ssize_t read;
>
> read = read_text("/proc/mounts", buf, sizeof(buf));
> @@ -560,7 +560,7 @@ int proc_mount_contains(const char *option)
>
> int cgroup_feature(const char *feature)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
> ssize_t read;
>
> read = read_text("/sys/kernel/cgroup/features", buf, sizeof(buf));
> @@ -587,7 +587,7 @@ ssize_t proc_read_text(int pid, bool thread, const char *item, char *buf, size_t
>
> int proc_read_strstr(int pid, bool thread, const char *item, const char *needle)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
>
> if (proc_read_text(pid, thread, item, buf, sizeof(buf)) < 0)
> return -1;
> diff --git a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
> index 567b1082974..febc1723d09 100644
> --- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
> +++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
> @@ -2,8 +2,8 @@
> #include <stdbool.h>
> #include <stdlib.h>
>
> -#ifndef PAGE_SIZE
> -#define PAGE_SIZE 4096
> +#ifndef BUF_SIZE
> +#define BUF_SIZE 4096
> #endif
I wouldn't preserve any previously defined BUF_SIZE here (as opposed to
possible more conventional PAGE_SIZE value).
I.e.
-#ifndef PAGE_SIZE
-#define PAGE_SIZE 4096
-#endif
+#define BUF_SIZE 4096
But it's nothing substantial.
>
> #define MB(x) (x << 20)
> diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
> index 7b83c7e7c9d..88ca832d4fc 100644
> --- a/tools/testing/selftests/cgroup/test_core.c
> +++ b/tools/testing/selftests/cgroup/test_core.c
> @@ -87,7 +87,7 @@ static int test_cgcore_destroy(const char *root)
> int ret = KSFT_FAIL;
> char *cg_test = NULL;
> int child_pid;
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
>
> cg_test = cg_name(root, "cg_test");
>
> diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c
> index 97fae92c838..160a9e6ad27 100644
> --- a/tools/testing/selftests/cgroup/test_freezer.c
> +++ b/tools/testing/selftests/cgroup/test_freezer.c
> @@ -642,7 +642,7 @@ static int test_cgfreezer_ptrace(const char *root)
> */
> static int proc_check_stopped(int pid)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
> int len;
>
> len = proc_read_text(pid, 0, "stat", buf, sizeof(buf));
> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> index b43da9bc20c..44338dbaee8 100644
> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -26,6 +26,7 @@
>
> static bool has_localevents;
> static bool has_recursiveprot;
> +static int page_size;
>
> int get_temp_fd(void)
> {
> @@ -34,7 +35,7 @@ int get_temp_fd(void)
>
> int alloc_pagecache(int fd, size_t size)
> {
> - char buf[PAGE_SIZE];
> + char buf[BUF_SIZE];
This buffer is actually used as the stride, so keeping it page-sized is
more sensible.
> struct stat st;
> int i;
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v7 5/8] selftests/cgroup: replace hardcoded page size values in test_zswap
From: Michal Koutný @ 2026-06-17 12:26 UTC (permalink / raw)
To: Li Wang
Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
Shakeel Butt
In-Reply-To: <20260424040059.12940-6-li.wang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 443 bytes --]
On Fri, Apr 24, 2026 at 12:00:56PM +0800, Li Wang <li.wang@linux.dev> wrote:
> @@ -752,6 +753,10 @@ int main(int argc, char **argv)
> char root[PATH_MAX];
> int i;
>
> + page_size = sysconf(_SC_PAGE_SIZE);
> + if (page_size <= 0)
> + page_size = BUF_SIZE;
> +
I'd just fail the whole test if this fails. To have page_size always
represent what it says. (When can this fail anyway? Maybe nommu archs?)
(Rest looks good.)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: [PATCH v7 7/8] selftest/cgroup: fix zswap attempt_writeback() on 64K pagesize system
From: Michal Koutný @ 2026-06-17 12:26 UTC (permalink / raw)
To: Li Wang
Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
Shakeel Butt
In-Reply-To: <20260424040059.12940-8-li.wang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]
On Fri, Apr 24, 2026 at 12:00:58PM +0800, Li Wang <li.wang@linux.dev> wrote:
> In attempt_writeback(), a memsize of 4M only covers 64 pages on 64K
> page size systems. When memory.reclaim is called, the kernel prefers
> reclaiming clean file pages (binary, libc, linker, etc.) over swapping
> anonymous pages. With only 64 pages of anonymous memory, the reclaim
> target can be largely or entirely satisfied by dropping file pages,
> resulting in very few or zero anonymous pages being pushed into zswap.
>
> This causes zswap_usage to be extremely small or zero, making
> zswap_usage/4 insufficient to create meaningful writeback pressure.
> The test then fails because no writeback is triggered.
>
> On 4K page size systems this is not an issue because 4M covers 1024
> pages, and file pages are a small fraction of the reclaim target.
>
> Fix this by:
> - Always allocating 1024 pages regardless of page size. This ensures
> enough anonymous pages to reliably populate zswap and trigger
> writeback, while keeping the original 4M allocation on 4K systems.
> - Setting zswap.max to zswap_usage/4 instead of zswap_usage/2 to
> create stronger writeback pressure, ensuring reclaim reliably
> triggers writeback even on large page size systems.
>
> === Error Log ===
> # uname -rm
> 6.12.0-211.el10.ppc64le ppc64le
>
> # getconf PAGESIZE
> 65536
>
> # ./test_zswap
> TAP version 13
> 1..7
> ok 1 test_zswap_usage
> ok 2 test_swapin_nozswap
> ok 3 test_zswapin
> not ok 4 test_zswap_writeback_enabled
> ...
>
> Signed-off-by: Li Wang <li.wang@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Yosry Ahmed <yosry@kernel.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> ---
> tools/testing/selftests/cgroup/test_zswap.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
Reviewed-by: Michal Koutný <mkoutny@suse.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox