Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH v3 7/7] sched/eevdf: Move to a single runqueue
From: Chen, Yu C @ 2026-06-20  3:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef,
	mingo
In-Reply-To: <20260605124052.227463677@infradead.org>

On 6/5/2026 8:40 PM, Peter Zijlstra wrote:
> Change fair/cgroup to a single runqueue.
> 
> Infamously fair/cgroup isn't working for a number of people; typically
> the complaint is latencies and/or overhead. The latency issue is due
> to the intermediate entries that represent a combination of tasks and
> thereby obfuscate the runnability of tasks.
> 
> The approach here is to leave the cgroup hierarchy as is; including
> the intermediate enqueue/dequeue but move the actual EEVDF runqueue
> outside. This means things like the shares_weight approximation are
> fully preserved.
> 
> That is, given a hierarchy like:
> 
>            R
>            |
>            se--G1
>                / \
>          G2--se   se--G3
>         / \           |
>    T1--se se--T2      se--T3
> 
> This is fully maintained for load tracking, however the EEVDF parts of
> cfs_rq/se go unused for the intermediates and are instead connected
> like:
> 
>       _R_
>      / | \
>     T1 T2 T3
> 
> Since the effective weight of the entities is determined by the
> hierarchy, this gets recomputed on enqueue,set_next_task and tick.
> 
> Notably, the effective weight (se->h_load) is computed from the
> hierarchical fraction: se->load / cfs_rq->load.
> 
> Since EEVDF is now exclusively operating on rq->cfs, it needs to
> consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
> only tasks can get delayed, simplifying some of the cgroup cleanup.
> 
> One place where additional information was required was
> set_next_task() / put_prev_task(), where we need to track 'current'
> both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
> (cfs_rq->curr).
> 
> As a result of only having a single level to pick from, much of the
> complications in pick_next_task() and preemption go away.
> 
> Since many of the hierarchical operations are still there, this won't
> immediately fix the performance issues, but hopefully it will fix some
> of the latency issues.
> 
> TODO: split struct cfs_rq / struct sched_entity
> TODO: try and get rid of h_curr
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

A divide-by-zero crash is observed when running hackbench:

   [14697.488452] CPU: 112 UID: 0 PID: 124791 Comm: hackbench Not 
tainted 7.1.0-rc2+
   [14697.492627] RIP: 0010:propagate_entity_load_avg+0x35f/0x3e0
   [14697.506799]  <TASK>
   [14697.507411]  __dequeue_task+0x2b4/0xc70
   [14697.508677]  dequeue_task_fair+0x36/0x370
   [14697.509047]  dequeue_task+0x101/0x2f0
   [14697.509426]  __schedule+0x1b1/0x1a00
   [14697.510868]  anon_pipe_read+0x3da/0x450
   [14697.511400]  vfs_read+0x361/0x390
   [14697.512053]  __x64_sys_read+0x19/0x30

The divide-by-zero happens here:

if (scale_load_down(gcfs_rq->load.weight)) {
         load_sum = div_u64(gcfs_rq->avg.load_sum,
                 scale_load_down(gcfs_rq->load.weight));
}

gcfs_rq->load.weight is an insane large value and is truncated
to the lower 32 bits by div_u64, which happen to be 0.

Using AI for investigation, the cause is a u32 overflow in
update_tg_cfs_runnable(), and flat pickup became a victim when using
tg_tasks():

   u32 new_sum, divider;
   ...
   new_sum = se->avg.runnable_avg * divider; <-- boom

The following sequence shows how this triggers the crash:

   propagate_entity_load_avg()
     update_tg_cfs_runnable()     # u32 overflow corrupts runnable_sum

   __update_load_avg_cfs_rq()
     ___update_load_avg()         # computes insane runnable_avg
   update_tg_load_avg()           # propagates to tg->runnable_avg

   update_cfs_group()
     calc_concur_shares()
       tg_tasks()                 # long-to-int truncation, negative nr
     reweight_entity()            # corrupted se->load.weight
       update_load_add()          # corrupted cfs_rq->load.weight

   propagate_entity_load_avg()
     update_tg_cfs_load()
       div_u64()                  # divide-by-zero

Fix by widening new_sum from u32 to u64(no need to force tg_tasks()
to return unsigned long after this fix)
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
  kernel/sched/fair.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d991ea85873a..99ea51448981 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5305,7 +5305,8 @@ static inline void
  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, 
struct cfs_rq *gcfs_rq)
  {
  	long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - 
se->avg.runnable_avg;
-	u32 new_sum, divider;
+	u64 new_sum;
+	u32 divider;

  	/* Nothing to update */
  	if (!delta_avg)
@@ -5319,7 +5320,7 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, 
struct sched_entity *se, struct cf

  	/* Set new sched_entity's runnable */
  	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
-	new_sum = se->avg.runnable_avg * divider;
+	new_sum = (u64)se->avg.runnable_avg * divider;
  	delta_sum = (long)new_sum - (long)se->avg.runnable_sum;
  	se->avg.runnable_sum = new_sum;

-- 
2.45.2

^ permalink raw reply related

* [PATCH] selftests/cgroup: Adjust cpu.max quota based on HZ
From: Joe Simmons-Talbott @ 2026-06-19 21:18 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Shuah Khan
  Cc: Joe Simmons-Talbott, cgroups, linux-kselftest, linux-kernel

For lower HZ values a quota of 1000us is much lower than the amount
of microseconds per tick which makes the test_cpucg_max and
test_cpugc_max_nested fail. Use the amount of microseconds per tick
as the quota value.

Signed-off-by: Joe Simmons-Talbott <joest@redhat.com>
---
 tools/testing/selftests/cgroup/test_cpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index 7a40d76b9548..4ac5d3ecae00 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -646,7 +646,7 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 static int test_cpucg_max(const char *root)
 {
 	int ret = KSFT_FAIL;
-	long quota_usec = 1000;
+	long quota_usec = USEC_PER_SEC / sysconf(_SC_CLK_TCK);
 	long default_period_usec = 100000; /* cpu.max's default period */
 	long duration_seconds = 1;
 
@@ -710,7 +710,7 @@ static int test_cpucg_max(const char *root)
 static int test_cpucg_max_nested(const char *root)
 {
 	int ret = KSFT_FAIL;
-	long quota_usec = 1000;
+	long quota_usec = USEC_PER_SEC / sysconf(_SC_CLK_TCK);
 	long default_period_usec = 100000; /* cpu.max's default period */
 	long duration_seconds = 1;
 
-- 
2.54.0


^ permalink raw reply related

* [REGRESSION] [PATCH v2 1/2] mm: vmalloc: streamline vmalloc memory accounting
From: Aishwarya Rambhadran @ 2026-06-19 12:53 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Uladzislau Rezki, Joshua Hahn, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, linux-mm, cgroups, linux-kernel,
	Ryan Roberts
In-Reply-To: <20260223160147.3792777-1-hannes@cmpxchg.org>

Hi Johannes,

We have observed kernel performance regressions in vmalloc benchmarks
when comparing v7.0 mainline results against later releases in the v7.1
cycle.
The regressions were detected by Fastpath, our automated kernel
performance benchmark and regression tracking framework.
Independent bisections on multiple arm64 systems consistently
identify this patch as the root cause. The regressions are reproducible
on both AWS Graviton3 & AmpereOne systems.

Fastpath bisection details :
Benchmark - micromm/vmalloc
Test - fix_size_alloc_test: p:512, h:1, l:100000
Good Kernel - v7.0
Bad Kernel - v7.1-rc4

The measured regression for the above test is approximately 32.5%
on AWS Graviton3. Similar regressions are observed across multiple
tests within the vmalloc benchmark suite as well as on AmpereOne.

Below given are the performance benchmark results of vmalloc
suite generated by Fastpath Tool, for v7.1 kernel version relative to
the base version v7.0, executed on the AWS Graviton3 SUT. Label (R)
mean statistically significant regression, where "statistically 
significant"
means the 95% confidence intervals do not overlap.

v7.0 (base) | v7.1
-------------------------------------------------------------------
fix_align_alloc_test: p:1, h:0, l:500000
895106.67 | (R) -10.73%

fix_size_alloc_test: p:1, h:0, l:500000
336785.00 | (R) -7.31%

fix_size_alloc_test: p:4, h:0, l:500000
529652.83 | (R) -13.11%

fix_size_alloc_test: p:16, h:0, l:500000
1043412.50 | (R) -21.92%

fix_size_alloc_test: p:16, h:1, l:500000
1015795.83 | (R) -22.02%

fix_size_alloc_test: p:64, h:0, l:100000
643074.33 | (R) -25.91%

fix_size_alloc_test: p:64, h:1, l:100000
607604.00 | (R) -27.31%

fix_size_alloc_test: p:256, h:0, l:100000
2367906.50 | (R) -27.67%

fix_size_alloc_test: p:256, h:1, l:100000
2275464.67 | (R) -28.66%

fix_size_alloc_test: p:512, h:0, l:100000
4696069.17 | (R) -28.15%

fix_size_alloc_test: p:512, h:1, l:100000
3767292.00 | (R) -32.65%

full_fit_alloc_test: p:1, h:0, l:500000
493884.17 | (R) -12.38%

kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
354542.83 | -2.31%

kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
358082.83 | -1.53%

long_busy_list_alloc_test: p:1, h:0, l:500000
5490101.33 | (R) -25.85%

pcpu_alloc_test: p:1, h:0, l:500000
193634.00 | -1.53%

random_size_align_alloc_test: p:1, h:0, l:500000
1200206.83 | (R) -11.88%

random_size_alloc_test: p:1, h:0, l:500000
2875736.33 | (R) -24.41%

vm_map_ram_test: p:1, h:0, l:500000
81204.33 | -0.28%
-------------------------------------------------------------------

The regression signal appears stable across repeated runs.
Have you seen similar effects before, or is there an expected
behavioral change associated with the conversion from the
custom atomic accounting to vmstat counters that could
explain this result ?

We would be happy to provide additional performance data,
kernel configurations or any other details if useful.

Thank you.
Aishwarya Rambhadran

On 23/02/26 9:31 PM, Johannes Weiner wrote:
> Use a vmstat counter instead of a custom, open-coded atomic. This has
> the added benefit of making the data available per-node, and prepares
> for cleaning up the memcg accounting as well.
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   fs/proc/meminfo.c       |  3 ++-
>   include/linux/mmzone.h  |  1 +
>   include/linux/vmalloc.h |  3 ---
>   mm/vmalloc.c            | 19 ++++++++++---------
>   mm/vmstat.c             |  1 +
>   5 files changed, 14 insertions(+), 13 deletions(-)
>
> V2:
> - Fix mod_node_page_state() pgdat argument (Shakeel)
>
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index a458f1e112fd..549793f44726 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -126,7 +126,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>   	show_val_kb(m, "Committed_AS:   ", committed);
>   	seq_printf(m, "VmallocTotal:   %8lu kB\n",
>   		   (unsigned long)VMALLOC_TOTAL >> 10);
> -	show_val_kb(m, "VmallocUsed:    ", vmalloc_nr_pages());
> +	show_val_kb(m, "VmallocUsed:    ",
> +		    global_node_page_state(NR_VMALLOC));
>   	show_val_kb(m, "VmallocChunk:   ", 0ul);
>   	show_val_kb(m, "Percpu:         ", pcpu_nr_pages());
>   
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc5d6c88d2f0..64df797d45c6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -220,6 +220,7 @@ enum node_stat_item {
>   	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
>   	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
>   	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
> +	NR_VMALLOC,
>   	NR_KERNEL_STACK_KB,	/* measured in KiB */
>   #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>   	NR_KERNEL_SCS_KB,	/* measured in KiB */
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index e8e94f90d686..3b02c0c6b371 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -286,8 +286,6 @@ int unregister_vmap_purge_notifier(struct notifier_block *nb);
>   #ifdef CONFIG_MMU
>   #define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
>   
> -unsigned long vmalloc_nr_pages(void);
> -
>   int vm_area_map_pages(struct vm_struct *area, unsigned long start,
>   		      unsigned long end, struct page **pages);
>   void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
> @@ -304,7 +302,6 @@ static inline void set_vm_flush_reset_perms(void *addr)
>   #else  /* !CONFIG_MMU */
>   #define VMALLOC_TOTAL 0UL
>   
> -static inline unsigned long vmalloc_nr_pages(void) { return 0; }
>   static inline void set_vm_flush_reset_perms(void *addr) {}
>   #endif /* CONFIG_MMU */
>   
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e286c2d2068c..a5fc7795aafd 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1063,14 +1063,8 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
>   static void drain_vmap_area_work(struct work_struct *work);
>   static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work);
>   
> -static __cacheline_aligned_in_smp atomic_long_t nr_vmalloc_pages;
>   static __cacheline_aligned_in_smp atomic_long_t vmap_lazy_nr;
>   
> -unsigned long vmalloc_nr_pages(void)
> -{
> -	return atomic_long_read(&nr_vmalloc_pages);
> -}
> -
>   static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root)
>   {
>   	struct rb_node *n = root->rb_node;
> @@ -3463,11 +3457,11 @@ void vfree(const void *addr)
>   		 * High-order allocs for huge vmallocs are split, so
>   		 * can be freed as an array of order-0 allocations
>   		 */
> +		if (!(vm->flags & VM_MAP_PUT_PAGES))
> +			dec_node_page_state(page, NR_VMALLOC);
>   		__free_page(page);
>   		cond_resched();
>   	}
> -	if (!(vm->flags & VM_MAP_PUT_PAGES))
> -		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>   	kvfree(vm->pages);
>   	kfree(vm);
>   }
> @@ -3655,6 +3649,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>   			continue;
>   		}
>   
> +		mod_node_page_state(page_pgdat(page), NR_VMALLOC, 1 << large_order);
> +
>   		split_page(page, large_order);
>   		for (i = 0; i < (1U << large_order); i++)
>   			pages[nr_allocated + i] = page + i;
> @@ -3675,6 +3671,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>   	if (!order) {
>   		while (nr_allocated < nr_pages) {
>   			unsigned int nr, nr_pages_request;
> +			int i;
>   
>   			/*
>   			 * A maximum allowed request is hard-coded and is 100
> @@ -3698,6 +3695,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>   							nr_pages_request,
>   							pages + nr_allocated);
>   
> +			for (i = nr_allocated; i < nr_allocated + nr; i++)
> +				inc_node_page_state(pages[i], NR_VMALLOC);
> +
>   			nr_allocated += nr;
>   
>   			/*
> @@ -3722,6 +3722,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>   		if (unlikely(!page))
>   			break;
>   
> +		mod_node_page_state(page_pgdat(page), NR_VMALLOC, 1 << order);
> +
>   		/*
>   		 * High-order allocations must be able to be treated as
>   		 * independent small pages by callers (as they can with
> @@ -3864,7 +3866,6 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>   			vmalloc_gfp_adjust(gfp_mask, page_order), node,
>   			page_order, nr_small_pages, area->pages);
>   
> -	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
>   	/* All pages of vm should be charged to same memcg, so use first one. */
>   	if (gfp_mask & __GFP_ACCOUNT && area->nr_pages)
>   		mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index d6e814c82952..bc199c7cd07b 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1270,6 +1270,7 @@ const char * const vmstat_text[] = {
>   	[I(NR_KERNEL_MISC_RECLAIMABLE)]		= "nr_kernel_misc_reclaimable",
>   	[I(NR_FOLL_PIN_ACQUIRED)]		= "nr_foll_pin_acquired",
>   	[I(NR_FOLL_PIN_RELEASED)]		= "nr_foll_pin_released",
> +	[I(NR_VMALLOC)]				= "nr_vmalloc",
>   	[I(NR_KERNEL_STACK_KB)]			= "nr_kernel_stack",
>   #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>   	[I(NR_KERNEL_SCS_KB)]			= "nr_shadow_call_stack",


^ permalink raw reply

* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Thomas Gleixner @ 2026-06-18 21:11 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Waiman Long
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <87cxxnegqa.ffs@fw13>

On Thu, Jun 18 2026 at 22:27, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> +		 */
>> +		if (irqd_affinity_is_managed(&desc->irq_data)) {
>
> So you set the affinity even on an interrupt which is shutdown?
>
>> +			const struct cpumask *mask;
>> +			struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);

How is this correct? You cannot get the per cpu pointer in preemptible
context. The task might be migrated and then fiddle with the wrong
per CPU data. But that's moot as this code is broken anyway.



^ permalink raw reply

* Re: [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
From: Thomas Gleixner @ 2026-06-18 21:01 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <871pe3de9b.ffs@fw13>

On Thu, Jun 18 2026 at 18:06, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> Add CPUHP_AP_NO_HZ_FULL_DYING and CPUHP_AP_IRQ_AFFINITY_DYING to the
>> cpuhp_state enum.  These dying callbacks are invoked during CPU offline
>> before the tick is stopped, enabling clean tick handover and managed
>> IRQ migration when a CPU transitions between isolated and housekeeping
>> states.
>>
>> The existing CPUHP_AP_IRQ_AFFINITY_ONLINE already handles managed IRQ
>> restoration on CPU online.  The new dying callback completes the pair,
>> migrating managed interrupts away from the CPU before it goes down.
>
> What? They are migrated away today already when the CPU goes down unless
> the CPU is the last one in the affinity set of the interrupt. So why do
> you need a new step for something which already exists?

Aside of that these hotplug states are not used at all. So what is this
patch for?


^ permalink raw reply

* Re: [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation
From: Thomas Gleixner @ 2026-06-18 20:55 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-11-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>  
>  	if (update_housekeeping) {
> +		static const unsigned long noise_types =
> +			BIT(HK_TYPE_KERNEL_NOISE) | BIT(HK_TYPE_MANAGED_IRQ);
> +
>  		update_housekeeping = false;
>  		cpumask_copy(isolated_hk_cpus, isolated_cpus);
>  
> -		/*
> -		 * housekeeping_update() is now called without holding
> -		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
> -		 * is still being held for mutual exclusion.
> -		 */

Why are you randomly removing useful comments?

^ permalink raw reply

* Re: [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu
From: Thomas Gleixner @ 2026-06-18 20:50 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-10-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> sched_tick_start() and sched_tick_stop() are called during CPU hotplug
> for CPUs not in the HK_TYPE_KERNEL_NOISE set.  They dereference
> tick_work_cpu, which is allocated by sched_tick_offload_init() and only
> called from housekeeping_init() when nohz_full= is present at boot.
>
> When the DHM subsystem first-enables HK_TYPE_KERNEL_NOISE at runtime via
> housekeeping_update_types(), tick_work_cpu remains NULL because
> sched_tick_offload_init() is __init-only and cannot be re-invoked.  A
> subsequent CPU offline/online cycle for an isolated CPU triggers
> WARN_ON_ONCE(!tick_work_cpu) followed by a NULL-pointer dereference in
> per_cpu_ptr(tick_work_cpu, cpu), crashing the kernel.
>
> Since nohz_full= was not active at boot, tick_nohz_full_running remains
> false and the tick-offload infrastructure is never activated; isolated
> CPUs continue to receive their own ticks.  Guard both helpers with an
> additional !tick_work_cpu check so they become no-ops in this case.

This is the same fake functionality as with the tick itself. Seriously?

> -	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
>  		return;
>  
>  	WARN_ON_ONCE(!tick_work_cpu);
> @@ -5799,7 +5799,7 @@ static void sched_tick_stop(int cpu)
>  	struct tick_work *twork;
>  	int os;
>  
> -	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
>  		return;
>  
>  	WARN_ON_ONCE(!tick_work_cpu);

Brilliant stuff that. Guard against tick_work_cpu == NULL and then keep
the WARN_ON() there, which became completely pointless.

But that's all just mindless tinkering and fixing the symptoms.

If all of this is runtime managed, then all the initialization needs to
be made unconditional. Yes, that wastes a few bytes of memory per CPU if
it's not used, but avoids these completely inconsistent hacks all over
the place and provides a coherent user interface.

Stop trying to duct tape this in. This needs more thoughts than just
sprinkling works a few works for me hacks all over the place.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Thomas Gleixner @ 2026-06-18 20:27 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Waiman Long
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-8-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> +
> +/*
> + * Managed IRQ housekeeping callback: iterate all managed IRQs and ask

S/IRQ/interrupt/ 

> + * the chip to move them off CPUs newly removed from HK_TYPE_MANAGED_IRQ.

Also this doesn't ask the chip to move it.

> + */
> +static void irq_hk_apply(enum hk_type type)
> +{
> +	cpumask_var_t hk_mask;
> +	struct irq_desc *desc;
> +	unsigned int irq;
> +
> +	if (!alloc_cpumask_var(&hk_mask, GFP_KERNEL))
> +		return;
> +
> +	/*
> +	 * Snapshot the new HK_TYPE_MANAGED_IRQ mask under an RCU read lock
> +	 * before iterating IRQ descriptors.  The lockdep annotation in
> +	 * housekeeping_cpumask() requires an RCU read-side critical section
> +	 * for runtime-mutable types.
> +	 */
> +	rcu_read_lock();
> +	cpumask_copy(hk_mask, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
> +	rcu_read_unlock();

Same comments as in the nohz patch.

> +
> +	irq_lock_sparse();
> +
> +	for_each_active_irq(irq) {
> +		desc = irq_to_desc(irq);
> +		if (!desc || !desc->action)
> +			continue;
> +

	for (unsigned int irq = 0; irq < total_nr_irqs; irq++) {
                struct irq_desc *desc;

                 scoped_guard(rcu)
                 	desc = irq_find_desc_at_or_after(irq);
                 ....

> +		/*
> +		 * Only managed interrupts are selected: they have
> +		 * IRQF_AFFINITY_MANAGED set, meaning the kernel owns their
> +		 * affinity.  User-controlled IRQs are intentionally skipped.
> +		 *
> +		 * When the intersection of the current affinity mask and the
> +		 * new housekeeping mask is non-empty, re-apply the restricted
> +		 * affinity to migrate the IRQ away from newly isolated CPUs.
> +		 * If the intersection is empty (all serving CPUs are now
> +		 * isolated), the IRQ is left on its current CPU temporarily;
> +		 * handling that case (IRQ shutdown / re-startup) is left for
> +		 * a follow-up.

Oh well...

> +		 */
> +		if (irqd_affinity_is_managed(&desc->irq_data)) {

So you set the affinity even on an interrupt which is shutdown?

> +			const struct cpumask *mask;
> +			struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);
> +
> +			raw_spin_lock_irq(&desc->lock);

                        guard()

> +			mask = irq_data_get_affinity_mask(&desc->irq_data);
> +			cpumask_and(tmp, mask, hk_mask);
> +			if (cpumask_intersects(tmp, cpu_online_mask))
> +				irq_do_set_affinity(&desc->irq_data, tmp, false);

That's completely broken. You _cannot_ change the affinity mask of a
managed interrupt. The mask itself is immutable.

The effective affinity can be changed by invoking the affinity setter
with the original unmodified mask. irq_do_set_affinity() already deals
with the housekeeping mask.

Also invoking irq_do_set_affinity() directly here is just wrong. It
breaks interrupts which cannot be moved in process context.

But even if that is fixed, then there is zero coordination with the
affected drivers/subsystems. Managed interrupts are related to device
and block queues and you cannot change one without the other. Neither
can you stop managed interrupts without quiescing the related device
queue. Starting them up requires also to reenable the device queue.

This problem needs to be fixed no matter what. See below.

> +static int irq_hk_validate(enum hk_type type,
> +			   const struct cpumask *cur_mask,
> +			   const struct cpumask *new_mask)
> +{
> +	if (!IS_ENABLED(CONFIG_SMP))
> +		return -EOPNOTSUPP;
> +	return 0;

Seriously? Why is this stuff even built when CONFIG_SMP=n?

So these validate callback seem to be just another voodoo container for
no value.

While this series might work for you by some definition of "works", it's
broken beyond repair and it's really annoying that I explained all of it
to the other people who try to solve that very same problem. Of course
you did not read any of that otherwise you would have CC'ed them.

     https://lore.kernel.org/lkml/87o6jcb84w.ffs@tglx

Trying to do that without taking the CPUs mostly offline and bringing
them online again is not going to work and there is zero benefit trying
to avoid that. First of all changing the isolation is not a hotpath
operation. Doing it one by one without bringing the CPU completely down
as I outlined in the above linked mail is not much more disruptive than
trying to do all of this on the fly. If you isolate a CPU then the tasks
on that CPU which do not belong to the isolation set need to get off the
CPU anyway. If you unisolate a CPU then it's really not a problem
whether the non-isolated tasks can move on it 10 milliseconds earlier or
later.

If you want to solve all the problems related to NOHZ, managed
interrupts, RCU etc. without the hotplug machinery then you end up
replicating half of it. Don't even try to think about it, that's a
complete waste of time and won't go anywhere.

Fix the few issues which are related to hotplug that I described in the
above linked mail and use the fully correct and tested common code for
your isolation muck. Please coordinate with Waiman or whoever is working
on it at RH right now.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
From: Thomas Gleixner @ 2026-06-18 19:49 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <87ik7fep2j.ffs@fw13>

On Thu, Jun 18 2026 at 19:27, Thomas Gleixner wrote:
> On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
>> Remove __init from ct_cpu_track_user() and __initdata from the
>> initialized flag so context tracking can be activated on CPUs that
>> join nohz_full at runtime.  Drop the __ro_after_init attribute from
>> the context_tracking_key static key, allowing static_branch_dec()
>> when a CPU leaves nohz_full.
>>
>> Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
>> the static key and clearing the per-CPU tracking state.
>
> Please do not enumerate WHAT the patch is doing. Explain the context and
> the WHY
>
>   https://docs.kernel.org/process/maintainer-tip.html#changelog

Just for the record. I told your colleague the same thing already....

^ permalink raw reply

* Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
From: Thomas Gleixner @ 2026-06-18 17:27 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-6-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> Remove __init from ct_cpu_track_user() and __initdata from the
> initialized flag so context tracking can be activated on CPUs that
> join nohz_full at runtime.  Drop the __ro_after_init attribute from
> the context_tracking_key static key, allowing static_branch_dec()
> when a CPU leaves nohz_full.
>
> Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
> the static key and clearing the per-CPU tracking state.

Please do not enumerate WHAT the patch is doing. Explain the context and
the WHY

  https://docs.kernel.org/process/maintainer-tip.html#changelog


>  
>  #include <asm/irq_regs.h>
> @@ -653,11 +654,6 @@ void __init tick_nohz_init(void)
>  	if (!tick_nohz_full_running)
>  		return;
>  
> -	/*
> -	 * Full dynticks uses IRQ work to drive the tick rescheduling on safe
> -	 * locking contexts. But then we need IRQ work to raise its own
> -	 * interrupts to avoid circular dependency on the tick.
> -	 */

This comment is removed because it's not longer correct? How is this
related to $Subject?

>  	if (!arch_irq_work_has_interrupt()) {
>  		pr_warn("NO_HZ: Can't run full dynticks because arch doesn't support IRQ work self-IPIs\n");
>  		cpumask_clear(tick_nohz_full_mask);
> @@ -676,6 +672,16 @@ void __init tick_nohz_init(void)
>  		}
>  	}
>  
> +	/*
> +	 * Pre-initialize context tracking for all possible CPUs so
> +	 * ctx tracking is already active when a CPU is later added to
> +	 * nohz_full at runtime.  The tracking overhead is negligible
> +	 * because the static key is not incremented yet — only per-CPU
> +	 * tracking state is set up.
> +	 */
> +	if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER_FORCE))
> +		context_tracking_init();

Seriously? Care to look where and when context_tracking_init() is invoked?

>  	for_each_cpu(cpu, tick_nohz_full_mask)
>  		ct_cpu_track_user(cpu);
>  
> @@ -686,6 +692,147 @@ void __init tick_nohz_init(void)
>  	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
>  		cpumask_pr_args(tick_nohz_full_mask));
>  }
> +
> +static int tick_nohz_hk_validate(enum hk_type type,
> +				 const struct cpumask *cur_mask,
> +				 const struct cpumask *new_mask)
> +{
> +	if (!IS_ENABLED(CONFIG_NO_HZ_FULL))
> +		return -EOPNOTSUPP;
> +	return 0;
> +}

Why is this code even compiled when CONFIG_NO_HZ_FULL is not enabled?

> +
> +static void tick_nohz_hk_apply(enum hk_type type)
> +{
> +	static DEFINE_SPINLOCK(tick_nohz_lock);
> +	cpumask_var_t nohz_full, added, removed;
> +	bool was_running;
> +	int cpu;
> +
> +	if (!alloc_cpumask_var(&nohz_full, GFP_KERNEL))
> +		return;

This looks more than wrong. If this fails then the core code will
happily proceed with the completely wrong state.

> +	if (!alloc_cpumask_var(&added, GFP_KERNEL)) {
> +		free_cpumask_var(nohz_full);
> +		return;
> +	}
> +	if (!alloc_cpumask_var(&removed, GFP_KERNEL)) {
> +		free_cpumask_var(added);
> +		free_cpumask_var(nohz_full);
> +		return;
> +	}

        cpumask_var_t __free(free_cpumask_var) a = CPUMASK_VAR_NULL;
        cpumask_var_t __free(free_cpumask_var) b = CPUMASK_VAR_NULL;
        cpumask_var_t __free(free_cpumask_var) c = CPUMASK_VAR_NULL;

        if (!alloc_cpumask_var(&a, GFP_KERNEL))
        	return -ENOMEM;
        ....

> +
> +	/*
> +	 * Snapshot the new HK_TYPE_KERNEL_NOISE mask under an RCU read lock.
> +	 * housekeeping_update_types() completes synchronize_rcu() before
> +	 * invoking apply(), so the new pointer is stable; however the lockdep
> +	 * annotation in housekeeping_cpumask() still requires an RCU read-side
> +	 * critical section for runtime-mutable types.

This comment is explaining the obvious: housekeeping_cpumask_rcu()

> +	 */
> +	rcu_read_lock();

        scoped_guard(rcu)


> +	cpumask_andnot(nohz_full, cpu_possible_mask,
> +		       housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE));
> +	rcu_read_unlock();
> +
> +	/*
> +	 * When "nohz_full=" was not passed at boot, tick_nohz_full_running is
> +	 * false and the full dynticks infrastructure (sched_tick_offload_init,
> +	 * RCU nohz quiescent-state reporting, context-tracking bootstrap) was
> +	 * never initialised.  In that case restrict the update to
> +	 * tick_nohz_full_mask so the /sys/devices/system/cpu/nohz_full sysfs
> +	 * attribute reflects DHM-isolated CPUs without enabling tick
> +	 * suppression, context tracking, or timer migration – all of which
> +	 * require boot-time setup and would deadlock on the first
> +	 * synchronize_rcu() call after CPUs are offlined.

What? You tell user space that the CPUs are nohz_full by updating the
mask, which is exposed in sysfs, which is blatantly wrong.

> +	 */
> +	was_running = READ_ONCE(tick_nohz_full_running);

Q: This READ_ONCE() pairs with which WRITE_ONCE()? 
A: With none, so it's just voodoo programming.

> +	spin_lock(&tick_nohz_lock);

This lock protects against the housekeeping core code invoking the apply
callback multiple times in parallel, right?

If that happens then there are bigger problems than corrupted masks.

> +	/*
> +	 * When nohz_full= was active at boot, compute the delta and update
> +	 * context tracking for CPUs joining or leaving the nohz_full set.
> +	 * Skip when !was_running: ct_cpu_track_user() calls
> +	 * static_branch_inc() which may sleep (jump_label_update on the
> +	 * 0→1 transition) – illegal inside a spinlock.

If you remove the pointless voodoo lock then this nonsense goes away too.

> +	 */
> +	if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER) &&
> +	    was_running &&
> +	    cpumask_available(tick_nohz_full_mask)) {

Why is this stuff even invoked when the mask is not available? If it's
not there then NOHZ full is not functional, period.

> +		cpumask_andnot(added, nohz_full, tick_nohz_full_mask);
> +		cpumask_andnot(removed, tick_nohz_full_mask, nohz_full);
> +		for_each_cpu(cpu, added)
> +			ct_cpu_track_user(cpu);
> +		for_each_cpu(cpu, removed)
> +			ct_cpu_untrack_user(cpu);
> +	}
> +
> +	/*
> +	 * Update tick_nohz_full_mask unconditionally: this is the snapshot
> +	 * read by the /sys/devices/system/cpu/nohz_full sysfs attribute and
> +	 * must reflect the current isolation set even in the DHM runtime case.
> +	 */
> +	if (cpumask_available(tick_nohz_full_mask))
> +		cpumask_copy(tick_nohz_full_mask, nohz_full);

Seriously?

> +	/*
> +	 * Only modify tick_nohz_full_running and migrate the global tick when
> +	 * nohz_full= was set at boot; without boot-time setup, setting
> +	 * tick_nohz_full_running would suppress ticks on isolated CPUs and
> +	 * prevent RCU quiescent-state reporting, causing synchronize_rcu()
> +	 * to stall permanently when a CPU is subsequently offlined.
> +	 */
> +	if (was_running) {

Again, why is any of this invoked when NOHZ full was never enabled and
initialized?

> +		tick_nohz_full_running = !cpumask_empty(nohz_full);

Brilliant. When NOHZ full was enabled on the command line, then changing
the mask can disable "running" and that makes it disabled forever. There
is no way to reenable it.

This 'was_running' check is just wrong. What you need is a
'tick_nohz_full_initialized' boolean, which is only true when nohz_full
was setup early on including the mask.

If that's not the case, then none of this code is supposed to run
ever. I.e. the callback is not installed in the first place.

> +	/*
> +	 * Ensure tick_nohz_full_mask is allocated so that tick_nohz_hk_apply()
> +	 * can update it (and the /sys/devices/system/cpu/nohz_full sysfs
> +	 * attribute) when CPUs are isolated at runtime via DHM.  If "nohz_full="
> +	 * was passed at boot the mask is already allocated; allocate an empty
> +	 * one here for the runtime-only case.

What's the runtime only case? The fake exposure in sysfs which is just
misleading the user? Not going to happen. If it's not enabled on the
command line then it's disabled, end of story.

> +	 */
> +	if (!cpumask_available(tick_nohz_full_mask) &&
> +	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL))
> +		pr_warn("tick/nohz: failed to allocate nohz_full_mask for DHM\n");

ROTFL. If the allocation fails, then the apply callback becomes a
complete noop doing magic cpumask operations for nothing and pretending
to be successful.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths
From: Thomas Gleixner @ 2026-06-18 16:06 UTC (permalink / raw)
  To: Jing Wu, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-5-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> Add CPUHP_AP_NO_HZ_FULL_DYING and CPUHP_AP_IRQ_AFFINITY_DYING to the
> cpuhp_state enum.  These dying callbacks are invoked during CPU offline
> before the tick is stopped, enabling clean tick handover and managed
> IRQ migration when a CPU transitions between isolated and housekeeping
> states.
>
> The existing CPUHP_AP_IRQ_AFFINITY_ONLINE already handles managed IRQ
> restoration on CPU online.  The new dying callback completes the pair,
> migrating managed interrupts away from the CPU before it goes down.

What? They are migrated away today already when the CPU goes down unless
the CPU is the last one in the affinity set of the interrupt. So why do
you need a new step for something which already exists?

> Subsequent patches register handlers for these states.
>
> Signed-off-by: Jing Wu <realwujing@gmail.com>
> Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>

This SOB chain is broken (in all patches). See Documentation/process/...

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
From: Nhat Pham @ 2026-06-18 12:36 UTC (permalink / raw)
  To: YoungJun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, yosry, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <ajNOSesjwTyZc8EX@yjaykim-PowerEdge-T330>

On Wed, Jun 17, 2026 at 9:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Jun 17, 2026 at 01:50:49PM -0400, Nhat Pham wrote:
>
> > On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > This is the v8 series of the swap tier patchset.
> > >
> > > Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> > > The main change in this version is the interface change to use
> > > memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> > > This mechanism was suggested by Shakeel and Yosry
> >
> > I like this interface too :)
>
> > I think Yosry wants zswap as a tier, right?
> >
> > Just that without vswap, maybe don't allow it to be an tier of itself?
>
> With the current architecture, users cannot dynamically specify zswap as
> a tier, and zswap is a separate layer, so it is not tiered by itself.
>
> Once your vswap work lands, I think we can make the zswap
> become the default, top-level tier.
>
> After that, we can also look into cleaning up the zswap.writeback
> interface together.

SGTM if Yosry is happy with it :) FWIW, zswap is a conceptual tier,
whether we want it to express with your interface or not. This is just
interface clean-up work.

>
> > #2: Inter-tier promotion and demotion:
> >   Promotion and demotion apply between tiers, not within a single
> >   tier. The current interface defines only tier assignment; it does
> >   not yet define when or how pages move between tiers. Two triggering
> >   models are possible:
> >
> > >   (a) User-triggered: userspace explicitly initiates migration between
> > >       tiers (e.g. via a new interface or existing move_pages semantics).
> > >   (b) Kernel-triggered: the kernel moves pages between tiers at
> > >       appropriate points such as reclaim or refault.
> >
> > We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)
> >
> > Cold pages will fill up fast tiers first, and more recent/warm pages
> > will land on slow tiers...
>
> Yeah, good point!
>
> > We'll also need to enforce isolation/fairness to make sure no wordload
> > hoard the fast tiers too (but that probably requires demotion
> > support).
>
> Right, that makes sense.
>
> BTW, One thing I am curious about, though, is whether there are strong
> real-world use cases that require demotion/promotion.
> Theoretically, this looks useful but it would be helpful to better understand
> the requirements from such deployments.

I think so, yeah. The LRU inversion problem above is one :) Hard to
make proper tiering without demotion.

Say I have a workload that have a SLO - for example a PSI target - but
don't particularly care about exact memory placement. To optimize
resource, we want to place the warmer stuff in fast tier, and the
coldest stuff in slow tier, etc. Having the ability to do demotion
derisk the initial placement - we can place things in the fast tier
initially (and rather aggressively), then as pages age and prove their
coldness, we can move them to slower and slower tier, etc.

Otherwise, what we end up with is really a placement preference
interface more than true tiering. Which is still useful especially
when co-tenant workloads have strict latency requirements, but perhaps
we don't need a full hierarchy-style interface for it? :)

The other use case is for fairness enforcement. We can (and probably
should) start with strict limits, but setting memory.swap.tier.max for
each cgroup is a bit of a drag, and it might leave stranded capacity
in cgroups that are allocated but not utilized their fast swap tier
capacity. If demotion is possible, we can let workloads use more than
what is fair, but then demote swap pages from swap tier to enforce
fairness when necessary...

Obviously, it's a moot point if there is no good mechanism to transfer
data one tier to another. The data might also be so cold that all of
this has diminishing returns, and moving things around cost more than
it's worth :) So I'm happy to start with something simple, then we can
figure out the next steps.

>
> > >
> > > #3: Per-VMA, per-process swap and BPF:
> > >   Not just for memcg based swap, possible to extend Per-VMA or per-process
> > >   swap. Or we can use it as BPF program.
> > >
> > > #4: Zswap and vswap tiering:
> > >   Tiering applies to the vswap + zswap combination.
> > >
> > > #5: Vswap on/off control:
> > >   Currently not supported. If a strong use case arises where vswap needs
> > >   to be controlled by memcg, the tier interface could be used for it.
> >
> > +1.
> >
> > Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
> > has a patch for it, IIUC, but if not it's pretty critical I'd say.
>
> Yes, I missed it. Thank you for addressing it.
> we need an implementation that integrates this with the per-CPU
> allocation currently implemented on the vswap side.
>
> If Kairui's patch lands, my patch #4 also can be optimized based on that.

Yup!!

>
> > BTW, can we add some selftests, to make sure the new interface works
> > as expected, and to have example programs for new users to model their
> > scripts after? :)
>
> Yes, I agree. I think selftests are necessary.
>
> Do you want them to be introduced in this patchset, or would it be okay
> to add them separately as follow-up work?

If you have to send another version, might as well include them :)

Otherwise a follow-up is good. Thanks in advance for keeping our
codebase tested!

I'll take a look at the exact implementation on the swap side later,
but I suspect nothing much will have changed :)

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-18 11:13 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: David Hildenbrand (Arm), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <90418cd3-751f-439d-83ed-a0c33517c3bd@kernel.org>

On Thu, Jun 18, 2026 at 10:21:30AM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/15/26 17:37, Gregory Price wrote:
> > 
> > One thought would be a way to switch what fallback list is used, and
> > then have specific fallback lists for certain contexts.
> > 
> > Right now there is a single example of this: __GFP_THISNODE
> >   |= __GFP_THISNODE   =>  NOFALLBACK
> >   &= ~__GFP_THISNODE  =>  FALLBACK
> > 
> > We could add an interface with the desired fallback list based as an
> > argument, and let get_page_from_freelist to prefer that over the default
> > global lists.
> 
> Does it mean a new argument in a number of functions in the page allocator,
> or can it be mapped to alloc_flags (at least internally?), because the
> number of possible fallback lists is small enough?
>

What I ended up with was adding a single page_alloc.c external interface
that allows you define the zonelist via an enum, and then an internal
selector resolution in prepare_alloc_pages() stored in alloc_context

eg:

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                struct alloc_context *ac, gfp_t *alloc_gfp,
                unsigned int *alloc_flags)
{       
        ac->highest_zoneidx = gfp_zone(gfp_mask);
        ac->zonelist = select_zonelist(preferred_nid, gfp_mask, ac->zlsel);
	... snip ...
}

struct folio *__folio_alloc_zonelist_noprof(gfp_t gfp, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                enum alloc_zonelist zlsel);


The original __folio_alloc* functions just add a DEFAULT - which tells
select_zonelist() to base the decision on __GFP_THISNODE.


struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
                nodemask_t *nodemask)
{
        return __folio_alloc_core(gfp, order, preferred_nid, nodemask,
                                  ALLOC_ZONELIST_DEFAULT);
}
EXPORT_SYMBOL(__folio_alloc_noprof);


This does a few things
  - The isolation is structural, there is no way to accidentally
    allocate private memory without passing ALLOC_ZONELIST_PRIVATE

  - The isolation forces folios - there are no non-folio interfaces
    which allow zonelist selection

  - The zonelist selection is confined to this allocation context,
    so no inheritence is possible.



I tried to avoid using an ALLOC_ flag so we can avoid yet another flag
crunch, but there certainly are few enough zonelists that we could
encode it there and expose it.  I know Brendan was looking at plumbing
alloc flags out to an interface, so i'm open to that.

Externally the way I determine what zonelist to use is a lookup based on
reason - letting the node filter.  This is really only needed in a
couple spots:

mm/khugepaged.c:  enum alloc_zonelist zlsel = alloc_zonelist_for_node(node, NODE_ALLOC_RECLAIM);
mm/vmscan.c:      mtc->zlsel = alloc_zonelist_for_nodemask(mtc->nmask, NODE_ALLOC_TIERING);
mm/migrate.c:     .zlsel = alloc_zonelist_for_node(node, NODE_ALLOC_USER_MIGRATE),

static inline enum alloc_zonelist
alloc_zonelist_for_node(int nid, enum node_alloc_reason reason)
{
        bool ok;

        if (!node_state(nid, N_MEMORY_PRIVATE))
                return ALLOC_ZONELIST_DEFAULT;
        switch (reason) {
        case NODE_ALLOC_RECLAIM:
                ok = node_is_reclaimable(nid);
                break;
        case NODE_ALLOC_TIERING:
                ok = node_allows_tiering(nid);
                break;
        case NODE_ALLOC_USER_MIGRATE:
                ok = node_allows_user_migrate(nid);
                break;
        default:
                ok = false;
        }
        return ok ? ALLOC_ZONELIST_PRIVATE : ALLOC_ZONELIST_DEFAULT;
}

Otherwise... everything is now a mempolicy w/ MPOL_F_BIND and all the
handling goes through the normal fault-paths :]

static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
                struct mempolicy *pol, pgoff_t ilx, int nid)
{
        nodemask_t *nodemask;
        struct page *page;
        enum alloc_zonelist zlsel = (pol->flags & MPOL_F_PRIVATE) ?
                ALLOC_ZONELIST_PRIVATE : ALLOC_ZONELIST_DEFAULT;
...
        if (pol->mode == MPOL_PREFERRED_MANY)
                return alloc_pages_preferred_many(gfp, order, nid, nodemask,
                                                  zlsel);
...
}


Switching to an alloc_flag would probably be trivially if that's really
wanted

~Gregory

^ permalink raw reply

* Re: [PATCH v2] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: David Hildenbrand (Arm) @ 2026-06-18  8:41 UTC (permalink / raw)
  To: Waiman Long, Gregory Price
  Cc: Farhad Alemi, Andrew Morton, Farhad Alemi, Yury Norov,
	Joshua Hahn, Zi Yan, Matthew Brost, Rakie Kim, Byungchul Park,
	Ying Huang, Alistair Popple, Rasmus Villemoes, linux-mm,
	linux-kernel, cgroups, stable
In-Reply-To: <c61c7925-b9e7-4a6f-82e2-398849ad9f27@redhat.com>

On 6/16/26 17:23, Waiman Long wrote:
> On 6/16/26 2:59 AM, David Hildenbrand (Arm) wrote:
>> On 6/16/26 05:43, Waiman Long wrote:
>>> BTW, I still prefer the v2 patch. If it is decided we should use the
>>> guarantee_online_mems() value instead, it will have to be a separate patch with
>>> changes in the relevant documentation like Documentation/admin-guide/cgroup-v1/
>>> cpuset.rst.
>> newmems is "obviously" correct, so I really don't see why we should add
>> something that needs half a page of text to explain why it is fine -- if newmems
>> just does the trick?
>>
>> Please enlighten me.
> 
> Yes, taking newmems is a reasonable choice and there are pros and cons with each
> options. My focus is more on not changing how v1 cpuset behaves as it is well
> defined in the v1 cpusets.rst file:
> 
>     Requests by a task, using the sched_setaffinity(2) system call to
>     include CPUs in its CPU affinity mask, and using the mbind(2) and
>     set_mempolicy(2) system calls to include Memory Nodes in its memory
>     policy, are both filtered through that task's cpuset, filtering out any
>     CPUs or Memory Nodes not in that cpuset.  The scheduler will not
>     schedule a task on a CPU that is not allowed in its cpus_allowed
>     vector, and the kernel page allocator will not allocate a page on a
>     node that is not allowed in the requesting task's mems_allowed vector.
> 
> v2, OTOH, is more vague as to what setting cpuset.mems will mean and we
> generally follow what v1 is doing, but we have more leeway of what we can do.
> 
> Using newmems will make the above text not totally correct. At least the offline
> memory nodes will be filtered out which will not be utilized by the task when
> the offline node becomes online. That is why I am saying that we will have to
> correct the documentation if we want to make this change.

So IIUC:

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1335e437098e..cdfc615f35a5 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2645,7 +2645,13 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
                migrate = is_memory_migrate(cs);
 
-               mpol_rebind_mm(mm, &cs->mems_allowed);
+               /*
+                * For v1 we can have empty effective_mems, but we cannot
+                * attach any tasks (see cpuset_can_attach_check()). For v2,
+                * it's guaranteed to not be empty.
+                */
+               VM_WARN_ON_ONCE(nodes_empty(cs->effective_mems));
+               mpol_rebind_mm(mm, &cs->effective_mems);
                if (migrate)
                        cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
                else


-- 
Cheers,

David

^ permalink raw reply related

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-18  8:31 UTC (permalink / raw)
  To: Gregory Price, Brendan Jackman
  Cc: Vlastimil Babka (SUSE), Balbir Singh, lsf-pc, linux-kernel,
	linux-cxl, cgroups, linux-mm, linux-trace-kernel, damon,
	kernel-team, gregkh, rafael, dakr, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	dan.j.williams, longman, akpm, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, osalvador, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	axelrasmussen, yuanchu, weixugc, yury.norov, linux, mhiramat,
	mathieu.desnoyers, tj, hannes, mkoutny, jackmanb, sj, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, muchun.song,
	xu.xin16, chengming.zhou, jannh, linmiaohe, nao.horiguchi,
	pfalcato, rientjes, shakeel.butt, riel, harry.yoo, cl,
	roman.gushchin, chrisl, kasong, shikemeng, nphamcs, bhe,
	zhengqi.arch, terry.bowman, Matthew Wilcox
In-Reply-To: <ajFT235iYsSJ7nbR@gourry-fedora-PF4VCD3F>

On 6/16/26 15:47, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 11:57:42AM +0000, Brendan Jackman wrote:
>> On Mon Jun 15, 2026 at 2:38 PM UTC, Vlastimil Babka (SUSE) wrote:
>>>
>>> I think the memalloc approach is dangerous due to unexpected nesting. There
>>> might be nested page allocations in page allocation itself (due to some
>>> debugging option). But also interrupts do not change what "current" points
>>> to. Suddenly those could start requesting folios and/or private nodes and be
>>> surprised, I'm afraid.
>>
>> Minor side-note: couldn't we just define it such that the allocator
>> ignores the context when not in_task() (and warn if you try to enter the
>> context while not currently in_task())?
>>
>> (Don't think this would change the conclusion very much, e.g. doesn't
>> help with the nesting issues. Mostly curious in case I'm missing a
>> detail here).
>>

So I took a look at which nested allocations we could end up having, and I
wonder whether gfp_nested_mask() indicates all these?

If we could reliably identify them, all we'd have to do is safe+restore some
context (activating a "nested" context).

> 
> I looked at this - only solves one issue and oh boy is that an obtuse
> confusing condition to understand.  We still suffer from recursion in
> reclaim.

Right, we'd have to clear the context before calling into reclaim/compaction
that does weird things.

I'm sure BPF hooks could just arbitrarily try to allocate pages with
kmalloc_nolock(). So that would require a context save/restore as well.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v3 14/15] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Hao Li @ 2026-06-18  8:23 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <26c29e4b-09b1-424a-b4e4-3358aac20115@kernel.org>

On Wed, Jun 17, 2026 at 04:36:58PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/17/26 15:56, Harry Yoo wrote:
> > 
> > 
> > On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> >> __GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
> >> gfp flags are a scarce resource, unlike slab's alloc_flags.
> >> 
> >> Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
> >> __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
> >> family function should not recurse into another kmalloc*() for the
> >> purposes of allocating auxiliary structures (obj_ext arrays or sheaves).
> >> 
> >> First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
> >> alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
> >> function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
> >> added. This will also pass through SLAB_ALLOC_NOLOCK so we don't need
> >> to special case kmalloc_nolock() anymore.
> >> 
> >> Note that until now the kmalloc_nolock() ignored the incoming gfp flags
> >> and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
> >> the incoming gfp flags (only augmented with __GFP_ZERO), because if
> >> alloc_flags contain SLAB_ALLOC_NOLOCK, the incoming gfp flags have to
> >> be also compatible with it. However, we might have added __GFP_THISNODE
> >> for opportunistic slab allocation, as pointed out by Hao Li, and
> >> __GFP_COMP by allocate_slab() as pointed out by Shengming Hu. Solve this
> >> by adding both flags to OBJCGS_CLEAR_MASK as it makes sense to strip
> >> them anyway for non-kmalloc_nolock() allocations of sheaves or obj_ext
> >> arrays as well.
> >> 
> >> To avoid recursion of sheaf -> obj_ext -> sheaf -> ... allocations at
> >> this patch, until the next patch converts sheaves to
> >> SLAB_ALLOC_NO_RECURSE, use both gfp and alloc_flags for obj_ext. The
> >> next patch will remove the gfp part.
> >> 
> >> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org
> >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >> ---
> > 
> > Looks good to me,
> > Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> 
> Thanks!
>  
> > With some comments below.
> > 
> > I was worried that perhaps replacing SLAB_ALLOC_NO_RECURSE with
> > __GFP_NO_OBJ_EXT will create a cycle of
> > 
> > alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
> > -> kmalloc_flags(SLAB_ALLOC_NO_RECURSE)
> > -> alloc_from_pcs(SLAB_ALLOC_NO_RECURSE)
> > -> refill_objects(SLAB_ALLOC_DEFAULT)
> > -> new_slab(SLAB_ALLOC_DEFAULT)
> > -> account_slab(SLAB_ALLOC_DEFAULT)
> > -> alloc_slab_obj_exts(SLAB_ALLOC_DEFAULT)
> > 
> > with __GFP_NO_OBJ_EXT, it would have been passed to refill_objects(),
> > but SLAB_ALLOC_NO_RECURSE is not. However this cycle does not exist
> > because alloc_slab_obj_exts() clears __GFP_ACCOUNT (as part of
> > OBJCG_CLEAR_MASK) and memory profiling itself does not invoke
> > alloc_slab_obj_exts() when allocating new slabs if SLAB_ACCOUNT is not
> > set (which is interesting, by the way).
> 
> Hm yeah I think we should propagate alloc_flags to refill_objects() etc, to 
> avoid later surprise. But can be done as a later cleanup.
>  
> > Also alloc_slab_obj_exts() propagating SLAB_ALLOC_NEW_SLAB to
> > kmalloc_flags() is little bit confusing because it does not have any
> > effect due to SLAB_ALLOC_NO_RECURSE.
> 
> OK let's address this one by this fixup:

Both the patch and the fix looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Vlastimil Babka (SUSE) @ 2026-06-18  8:21 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, Matthew Wilcox
In-Reply-To: <ajAcIwBAnqgEEWSD@gourry-fedora-PF4VCD3F>

On 6/15/26 17:37, Gregory Price wrote:
> On Mon, Jun 15, 2026 at 05:18:55PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/15/26 16:38, Vlastimil Babka (SUSE) wrote:
>> > 
>> > I think the memalloc approach is dangerous due to unexpected nesting. There
>> > might be nested page allocations in page allocation itself (due to some
>> > debugging option). But also interrupts do not change what "current" points
>> > to. Suddenly those could start requesting folios and/or private nodes and be
>> > surprised, I'm afraid.
>> 
>> Yeah, we'd need some way to distinguish the main allocation from these other
>> (nested) allocations.
>>
>> 
>> > 
>> > The memalloc scopes only work well when they restrict the context wrt
>> > reclaim, and allocations in IRQ have to be already restricted heavily
>> > (atomic) so further memalloc restrictions don't do anything in practice. But
>> > to make them change other aspects of the allocations like this won't work.
>> 
>> I was assuming that memalloc_pin_save() would already violate that, but really
>> it only restricts where movable allocations land, and that doesn't matter for
>> other kernel allocations.
>> 
>> Do you see any other way to make something like an allocation context work, and
>> avoid introducing more GFP flags?
>>
> 
> One thought would be a way to switch what fallback list is used, and
> then have specific fallback lists for certain contexts.
> 
> Right now there is a single example of this: __GFP_THISNODE
>   |= __GFP_THISNODE   =>  NOFALLBACK
>   &= ~__GFP_THISNODE  =>  FALLBACK
> 
> We could add an interface with the desired fallback list based as an
> argument, and let get_page_from_freelist to prefer that over the default
> global lists.

Does it mean a new argument in a number of functions in the page allocator,
or can it be mapped to alloc_flags (at least internally?), because the
number of possible fallback lists is small enough?

> Omit all special nodes from FALLBACK/NOFALLBACK and make the special
> contexts provide the fallback-base that should be used.
> 
> On my current branch i think that would include modifying, in totality:
> 
>    alloc_folio_mpol()
>    alloc_demotion_folio()
>    alloc_migration_target()
> 
> And i'm pretty sure that all just nests nicely.
> 
> We might not even need memalloc... hmmm
> 
> ~Gregory


^ permalink raw reply

* Re: [PATCH v3 10/15] mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
From: Hao Li @ 2026-06-18  8:09 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-10-ce1146d140fb@kernel.org>

On Mon, Jun 15, 2026 at 01:54:43PM +0200, Vlastimil Babka (SUSE) wrote:
> The last user of gfpflags_allow_spinning() in slab is
> alloc_from_pcs_bulk(), which is only called from
> kmem_cache_alloc_bulk().
> 
> It turns out that gfpflags_allow_spinning() is not necessary, because
> kmem_cache_alloc_bulk() is only expected to be called from context that
> does allow spinning, so simply replace it with 'true'. This means we can
> also drop the gfp parameter from alloc_from_pcs_bulk().
> 
> With that, we can remove the "@flags must allow spinning" part of the
> kernel doc, as there is no more connection to the gfp flags in the slab
> implementation.
> 
> Also remove a comment in alloc_slab_obj_exts() because there should be
> no more false positives possible due to gfp_allowed_mask during early
> boot.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-11-7190909db118@kernel.org
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v3 08/15] mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
From: Hao Li @ 2026-06-18  8:07 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-8-ce1146d140fb@kernel.org>

On Mon, Jun 15, 2026 at 01:54:41PM +0200, Vlastimil Babka (SUSE) wrote:
> Convert the whole following call stack to pass either slab_alloc_context
> (thus including alloc_flags) or just alloc_flags as necessary:
> 
> slab_post_alloc_hook()
>   alloc_tagging_slab_alloc_hook()
>     __alloc_tagging_slab_alloc_hook()
>       prepare_slab_obj_exts_hook()
>         alloc_slab_obj_exts()
>   memcg_slab_post_alloc_hook()
>     __memcg_slab_post_alloc_hook()
>       alloc_slab_obj_exts()
> 
> Converting all these at once avoids unnecessary churn and is mostly
> mechanical.
> 
> This ultimately allows to decide if spinning is allowed using
> alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook().
> Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing
> else in slab itself relying on gfpflags_allow_spinning() which can
> be false even if not called from kmalloc_nolock().
> 
> A followup change will also use the alloc_flags availability in the call
> stack above to remove the __GFP_NO_OBJ_EXT flag.
> 
> For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab"
> parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality.
> 
> To further reduce the number of parameters of slab_post_alloc_hook(),
> also make 'struct list_lru *lru' (which is NULL for most callers) a new
> field of slab_alloc_context.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-9-7190909db118@kernel.org
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* [PATCH v3 13/13] selftests/cgroup: Add kernel-noise isolation test to cpuset selftest
From: Jing Wu @ 2026-06-18  3:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Thomas Gleixner
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>

Add test_hk_noise_isolated() to test_cpuset_prs.sh to verify that
creating and destroying an isolated cpuset partition updates both the
domain isolation state and the kernel-noise (nohz_full) state.

For domain isolation, the test checks cpuset.cpus.isolated before and
after the partition create/destroy cycle.

For kernel-noise isolation, the test reads
/sys/devices/system/cpu/nohz_full to confirm that the CPUs placed in
an isolated partition appear in the nohz_full mask while the partition
is active, and are removed from it once the partition is destroyed.
This sysfs attribute only exists when CONFIG_NO_HZ_FULL is enabled;
the nohz_full checks are skipped when it is absent so the test remains
usable on kernels without NO_HZ_FULL.

Add cpu_in_cpulist() to correctly determine whether a CPU number falls
within a kernel cpulist string (e.g. "4-7").  A plain grep cannot
detect membership in the interior of a range; cpu_in_cpulist() walks
each comma-separated element and handles both single values and
lo-hi ranges explicitly.

The test also covers: rejection of all-CPU isolation, the SMT sibling
constraint, nested partition inheritance, and a 100-cycle pressure test.
nohz_full is verified to be restored to its pre-test value after each
create/destroy cycle and after the pressure test.

Fix awk invocation to drop the spurious -e flag.

Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 tools/testing/selftests/cgroup/test_cpuset_prs.sh | 204 +++++++++++++++++++++-
 1 file changed, 203 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index a56f4153c64df..047db14953fac 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -20,7 +20,7 @@ skip_test() {
 WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
 
 # Find cgroup v2 mount point
-CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+CGROUP2=$(mount -t cgroup2 | head -1 | awk '{print $3}')
 [[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
 SUBPARTS_CPUS=$CGROUP2/.__DEBUG__.cpuset.cpus.subpartitions
 CPULIST=$(cat $CGROUP2/cpuset.cpus.effective)
@@ -1204,9 +1204,211 @@ test_inotify()
 	echo "" > cpuset.cpus
 }
 
+#
+# cpu_in_cpulist <cpu> <cpulist>
+#
+# Return 0 if <cpu> appears in <cpulist> (a kernel cpumask list such as
+# "0-3,8-31"), non-zero otherwise.  The kernel cpulist format uses ranges
+# ("lo-hi") and comma-separated items; a simple grep cannot detect that a
+# number falls in the middle of a range, so walk each element explicitly.
+#
+cpu_in_cpulist()
+{
+	local cpu=$1 list=$2 range lo hi
+	for range in $(echo "$list" | tr ',' ' '); do
+		if [[ "$range" == *-* ]]; then
+			lo=${range%-*}
+			hi=${range#*-}
+			[[ $cpu -ge $lo && $cpu -le $hi ]] && return 0
+		else
+			[[ $cpu -eq $range ]] && return 0
+		fi
+	done
+	return 1
+}
+
+#
+# Test that isolated partition creation/destruction drives kernel-noise
+# housekeeping mask updates and remains correct under pressure.
+#
+# Requires: >=8 CPUs, no isolcpus= boot conflict, root
+#
+test_hk_noise_isolated()
+{
+	local ISOL_BEFORE TEST_CPUS i PART ISOL_AFTER ISOL_RESTORE
+	local NOHZ_FILE NOHZ_BEFORE NOHZ_AFTER NOHZ_RESTORE
+	local HK_NOHZ_CHECK=0
+	local LOOPS=100
+
+	[[ $NR_CPUS -ge 8 ]] || {
+		echo "HK-noise test skipped: need >=8 CPUs, have $NR_CPUS"
+		return 0
+	}
+
+	# Detect whether CONFIG_NO_HZ_FULL is active: the sysfs attribute
+	# /sys/devices/system/cpu/nohz_full exposes the current nohz_full
+	# cpumask and is only present when NO_HZ_FULL is enabled.
+	NOHZ_FILE=/sys/devices/system/cpu/nohz_full
+	[[ -r "$NOHZ_FILE" ]] && HK_NOHZ_CHECK=1
+
+	cd $CGROUP2/test
+	echo member > cpuset.cpus.partition 2>/dev/null
+	echo "" > cpuset.cpus 2>/dev/null
+
+	ISOL_BEFORE=$(cat $CGROUP2/cpuset.cpus.isolated)
+	[[ $HK_NOHZ_CHECK -eq 1 ]] && NOHZ_BEFORE=$(cat $NOHZ_FILE)
+	TEST_CPUS="4-7"
+	echo $TEST_CPUS > cpuset.cpus
+
+	#
+	# Basic create/destroy cycle — verify domain isolation and
+	# kernel-noise (nohz_full) changes together.
+	#
+	console_msg "HK-noise: basic create/destroy cycle"
+	echo isolated > cpuset.cpus.partition
+
+	ISOL_AFTER=$(cat $CGROUP2/cpuset.cpus.isolated)
+	[[ $ISOL_AFTER != "$ISOL_BEFORE" ]] || {
+		echo "FAIL: isolated set unchanged after partition create"
+		exit 1
+	}
+
+	if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+		NOHZ_AFTER=$(cat $NOHZ_FILE)
+		# Verify that the newly isolated CPUs (4-7) appear in nohz_full.
+		# nohz_full = inverse of housekeeping, so isolating 4-7 should
+		# add them to nohz_full.
+		for cpu in 4 5 6 7; do
+			if ! cpu_in_cpulist $cpu "$NOHZ_AFTER"; then
+				echo "FAIL: cpu${cpu} not in nohz_full after isolation" \
+				     "(got: '$NOHZ_AFTER')"
+				exit 1
+			fi
+		done
+		console_msg "HK-noise: nohz_full after isolation: $NOHZ_AFTER"
+	fi
+
+	echo member > cpuset.cpus.partition
+
+	ISOL_RESTORE=$(cat $CGROUP2/cpuset.cpus.isolated)
+	[[ $ISOL_RESTORE = "$ISOL_BEFORE" ]] || {
+		echo "FAIL: expected '$ISOL_BEFORE' after destroy, got '$ISOL_RESTORE'"
+		exit 1
+	}
+
+	if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+		NOHZ_RESTORE=$(cat $NOHZ_FILE)
+		[[ "$NOHZ_RESTORE" = "$NOHZ_BEFORE" ]] || {
+			echo "FAIL: nohz_full not restored: expected '$NOHZ_BEFORE'," \
+			     "got '$NOHZ_RESTORE'"
+			exit 1
+		}
+	fi
+
+	#
+	# Reject all-CPU isolation (must leave at least one housekeeping CPU)
+	#
+	console_msg "HK-noise: reject all-CPU isolation"
+	echo 0-$((NR_CPUS - 1)) > cpuset.cpus
+	echo isolated > cpuset.cpus.partition
+	PART=$(cat cpuset.cpus.partition)
+	[[ $PART = *invalid* || $PART = member ]] || {
+		echo "FAIL: all-CPU isolation was not rejected, got '$PART'"
+		exit 1
+	}
+
+	#
+	# SMT safety: partial sibling isolation
+	#
+	console_msg "HK-noise: SMT sibling constraint"
+	echo $TEST_CPUS > cpuset.cpus
+	echo isolated > cpuset.cpus.partition
+	PART=$(cat cpuset.cpus.partition)
+	[[ $PART = isolated ]] || {
+		echo "FAIL: could not create isolated partition, got '$PART'"
+		exit 1
+	}
+	echo member > cpuset.cpus.partition
+
+	#
+	# Nested partition: parent root → child isolated
+	#
+	console_msg "HK-noise: nested partition inheritance"
+	echo $TEST_CPUS > cpuset.cpus
+	test_partition root
+	mkdir -p HK_SUB
+	cd HK_SUB
+	echo 4-5 > cpuset.cpus
+	echo isolated > cpuset.cpus.partition
+	ISOL_AFTER=$(cat $CGROUP2/cpuset.cpus.isolated)
+	[[ -n $ISOL_AFTER ]] || {
+		echo "FAIL: nested isolated partition not reflected in cpuset.cpus.isolated"
+		exit 1
+	}
+	echo member > cpuset.cpus.partition
+	cd $CGROUP2/test
+	echo member > cpuset.cpus.partition
+	rmdir HK_SUB 2>/dev/null
+
+	#
+	# Pressure test: 100 create/destroy cycles
+	#
+	console_msg "HK-noise: pressure test ($LOOPS cycles)"
+	echo $TEST_CPUS > cpuset.cpus
+	for i in $(seq 1 $LOOPS); do
+		echo isolated > cpuset.cpus.partition
+		PART=$(cat cpuset.cpus.partition)
+		[[ $PART = isolated ]] || {
+			echo "FAIL: cycle $i create failed, got '$PART'"
+			exit 1
+		}
+		echo member > cpuset.cpus.partition
+		PART=$(cat cpuset.cpus.partition)
+		[[ $PART = member ]] || {
+			echo "FAIL: cycle $i destroy failed, got '$PART'"
+			exit 1
+		}
+	done
+
+	#
+	# Stability: after pressure test, verify final state
+	#
+	console_msg "HK-noise: post-pressure cleanup"
+	echo isolated > cpuset.cpus.partition
+	ISOL_AFTER=$(cat $CGROUP2/cpuset.cpus.isolated)
+	[[ -n $ISOL_AFTER ]] || {
+		echo "FAIL: isolated set empty after pressure test"
+		exit 1
+	}
+	echo member > cpuset.cpus.partition
+	echo "" > cpuset.cpus
+	ISOL_RESTORE=$(cat $CGROUP2/cpuset.cpus.isolated)
+	[[ $ISOL_RESTORE = "$ISOL_BEFORE" ]] || {
+		echo "FAIL: final isolated '$ISOL_RESTORE' != '$ISOL_BEFORE'"
+		exit 1
+	}
+
+	if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+		NOHZ_RESTORE=$(cat $NOHZ_FILE)
+		[[ "$NOHZ_RESTORE" = "$NOHZ_BEFORE" ]] || {
+			echo "FAIL: nohz_full not restored after pressure test:" \
+			     "expected '$NOHZ_BEFORE', got '$NOHZ_RESTORE'"
+			exit 1
+		}
+	fi
+
+	cd $CGROUP2
+	if [[ $HK_NOHZ_CHECK -eq 1 ]]; then
+		console_msg "HK-noise: PASSED (with nohz_full verification)"
+	else
+		console_msg "HK-noise: PASSED (nohz_full skipped: CONFIG_NO_HZ_FULL not active)"
+	fi
+}
+
 trap cleanup 0 2 3 6
 run_state_test TEST_MATRIX
 run_remote_state_test REMOTE_TEST_MATRIX
 test_isolated
 test_inotify
+test_hk_noise_isolated
 echo "All tests PASSED."

-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 12/13] docs: cgroup-v2: Document kernel-noise isolation via isolated partitions
From: Jing Wu @ 2026-06-18  3:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Thomas Gleixner
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>

Document that cpuset.cpus.partition=isolated now drives runtime updates
of the housekeeping masks for kernel-noise types: nohz_full (tick
suppression), RCU NOCB offloading, and managed IRQ migration.  No
additional cgroupfs files are required; the partition update path
automatically triggers explicit housekeeping callbacks for all affected
subsystems.

Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 Documentation/admin-guide/cgroup-v2.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed995..7c3b048e75cb5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2721,6 +2721,14 @@ Cpuset Interface Files
 	kernel boot command line option.  If those CPUs are to be put
 	into a partition, they have to be used in an isolated partition.
 
+	When an isolated partition is created or destroyed, the kernel
+	automatically drives runtime updates of the housekeeping masks
+	for kernel-noise types (nohz_full, RCU NOCB, managed IRQ
+	interrupts).  This extends isolation beyond scheduler domains:
+	the tick is stopped on isolated CPUs, RCU callbacks are
+	offloaded to housekeeping cores, and managed interrupts are
+	migrated away.  No additional cgroupfs files are required.
+
 
 Device controller
 -----------------

-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation
From: Jing Wu @ 2026-06-18  3:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Thomas Gleixner
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>

When a cpuset isolated partition is created or destroyed, also drive
kernel-noise housekeeping types (HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ) through housekeeping_update_types().  The sched
domain mask (HK_TYPE_DOMAIN) is updated first via the existing
housekeeping_update() call, then the explicit callback chain in
housekeeping_update_types() invokes subsystem apply() handlers to
toggle nohz_full, managed IRQ migration, and RCU NOCB offloading.

The update runs outside cpuset_mutex and cpus_read_lock, protected
only by cpuset_top_mutex.

Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 kernel/cgroup/cpuset.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5c33ab20cc208..67b93bd4d58f2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1347,17 +1347,30 @@ static void cpuset_update_sd_hk_unlock(void)
 		rebuild_sched_domains_locked();
 
 	if (update_housekeeping) {
+		static const unsigned long noise_types =
+			BIT(HK_TYPE_KERNEL_NOISE) | BIT(HK_TYPE_MANAGED_IRQ);
+
 		update_housekeeping = false;
 		cpumask_copy(isolated_hk_cpus, isolated_cpus);
 
-		/*
-		 * housekeeping_update() is now called without holding
-		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
-		 * is still being held for mutual exclusion.
-		 */
 		mutex_unlock(&cpuset_mutex);
 		cpus_read_unlock();
+
+		/*
+		 * Update the sched domain mask first; it must succeed
+		 * before the kernel-noise types because workqueue flush
+		 * and timer migration depend on the sched domain mask.
+		 */
 		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
+
+		/*
+		 * Drive kernel-noise types through the new explicit
+		 * callback chain.  Tik/rcu/genirq subtypes react
+		 * through their registered housekeeping_cbs apply()
+		 * handlers.
+		 */
+		WARN_ON_ONCE(housekeeping_update_types(noise_types,
+						       isolated_hk_cpus));
 		mutex_unlock(&cpuset_top_mutex);
 	} else {
 		cpuset_full_unlock();

-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu
From: Jing Wu @ 2026-06-18  3:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Thomas Gleixner
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>

sched_tick_start() and sched_tick_stop() are called during CPU hotplug
for CPUs not in the HK_TYPE_KERNEL_NOISE set.  They dereference
tick_work_cpu, which is allocated by sched_tick_offload_init() and only
called from housekeeping_init() when nohz_full= is present at boot.

When the DHM subsystem first-enables HK_TYPE_KERNEL_NOISE at runtime via
housekeeping_update_types(), tick_work_cpu remains NULL because
sched_tick_offload_init() is __init-only and cannot be re-invoked.  A
subsequent CPU offline/online cycle for an isolated CPU triggers
WARN_ON_ONCE(!tick_work_cpu) followed by a NULL-pointer dereference in
per_cpu_ptr(tick_work_cpu, cpu), crashing the kernel.

Since nohz_full= was not active at boot, tick_nohz_full_running remains
false and the tick-offload infrastructure is never activated; isolated
CPUs continue to receive their own ticks.  Guard both helpers with an
additional !tick_work_cpu check so they become no-ops in this case.

Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 371b509d92164..df004e3efca70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5778,7 +5778,7 @@ static void sched_tick_start(int cpu)
 	int os;
 	struct tick_work *twork;

-	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
 		return;

 	WARN_ON_ONCE(!tick_work_cpu);
@@ -5799,7 +5799,7 @@ static void sched_tick_stop(int cpu)
 	struct tick_work *twork;
 	int os;

-	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
+	if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || !tick_work_cpu)
 		return;

 	WARN_ON_ONCE(!tick_work_cpu);

-- 
2.43.0

^ permalink raw reply related

* [PATCH v3 09/13] watchdog/lockup_detector: Register housekeeping callback for kernel-noise
From: Jing Wu @ 2026-06-18  3:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Thomas Gleixner
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>

Initialize watchdog_cpumask from HK_TYPE_KERNEL_NOISE rather than
HK_TYPE_TIMER at boot, so the initial mask already reflects any CPUs
excluded by nohz_full= on the kernel command line.

Register a housekeeping_cbs so watchdog_cpumask stays in sync with
HK_TYPE_KERNEL_NOISE when isolation boundaries change at runtime via
cpuset isolated partitions.  The apply() callback copies the new
housekeeping mask into watchdog_cpumask and triggers
__lockup_detector_reconfigure() to restart watchdog threads on the
updated CPU set.

When nohz_full= is absent at boot, tick_nohz_full_running remains
false and DHM isolated partitions do not activate tick suppression.
In that case watchdog_hk_apply() is a no-op: there is no need to
reconfigure the watchdog CPU set because the full nohz_full
infrastructure was never initialized.

Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 kernel/watchdog.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 87dd5e0f6968d..998ad94da4cb9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -1389,7 +1389,7 @@ void __init lockup_detector_init(void)
 		pr_info("Disabling watchdog on nohz_full cores by default\n");
 
 	cpumask_copy(&watchdog_cpumask,
-		     housekeeping_cpumask(HK_TYPE_TIMER));
+		     housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
 
 	if (!watchdog_hardlockup_probe())
 		watchdog_hardlockup_available = true;
@@ -1398,3 +1398,57 @@ void __init lockup_detector_init(void)
 
 	lockup_detector_setup();
 }
+
+/*
+ * Watchdog housekeeping callback: resync watchdog_cpumask with
+ * HK_TYPE_KERNEL_NOISE when isolation boundaries change at runtime.
+ */
+#ifdef CONFIG_CPU_ISOLATION
+static void watchdog_hk_apply(enum hk_type type)
+{
+	const struct cpumask *hk;
+
+	/*
+	 * When nohz_full= was not given at boot, tick_nohz_full_running
+	 * remains false and the full nohz_full infrastructure was never
+	 * initialised.  DHM isolated partitions do not activate tick
+	 * suppression in that case, so there is no need to reconfigure the
+	 * watchdog CPU set.
+	 */
+#ifdef CONFIG_NO_HZ_FULL
+	if (!READ_ONCE(tick_nohz_full_running))
+		return;
+#endif
+
+	hk = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
+	if (mutex_trylock(&watchdog_mutex)) {
+		cpumask_copy(&watchdog_cpumask, hk);
+		__lockup_detector_reconfigure(false);
+		mutex_unlock(&watchdog_mutex);
+	}
+}
+
+static int watchdog_hk_validate(enum hk_type type,
+				const struct cpumask *cur_mask,
+				const struct cpumask *new_mask)
+{
+	return 0;
+}
+
+static struct housekeeping_cbs watchdog_hk_cbs = {
+	.name		= "watchdog",
+	.pre_validate	= watchdog_hk_validate,
+	.apply		= watchdog_hk_apply,
+};
+
+static int __init watchdog_hk_init(void)
+{
+	int ret;
+
+	ret = housekeeping_register_cbs(HK_TYPE_KERNEL_NOISE, &watchdog_hk_cbs);
+	if (ret)
+		pr_debug("watchdog: hk callback registration skipped (%d)\n", ret);
+	return 0;
+}
+late_initcall(watchdog_hk_init);
+#endif

-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Jing Wu @ 2026-06-18  3:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Tejun Heo, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Thomas Gleixner
  Cc: linux-kernel, rcu, cgroups, linux-doc, linux-kselftest, Jing Wu,
	Qiliang Yuan
In-Reply-To: <20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com>

Register a housekeeping callback for HK_TYPE_MANAGED_IRQ.  When the
mask changes, iterate all active managed interrupts, intersect their
current affinity mask with the new housekeeping mask, and re-apply
with irq_do_set_affinity().  Managed interrupts on CPUs removed from
the housekeeping set are migrated to remaining housekeeping CPUs.

Only managed interrupts (IRQF_AFFINITY_MANAGED) are selected because
the kernel owns their affinity; user-controlled IRQ affinities must
not be overridden by the housekeeping layer.

The new HK_TYPE_MANAGED_IRQ cpumask is snapshotted once under an RCU
read lock before the IRQ loop, satisfying the lockdep annotation in
housekeeping_cpumask() for runtime-mutable types.

When the intersection of the IRQ's current affinity and the new
housekeeping mask is non-empty, irq_do_set_affinity() moves the IRQ
to the restricted set.  If the intersection is empty (all CPUs that
were serving this IRQ are now isolated), the affinity update is skipped
and the IRQ continues to run on the isolated CPU temporarily.  Full
support for the IRQ shutdown / re-startup path (when all serving CPUs
become isolated) is left for follow-up work.

Guarded by irq_lock_sparse() and per-descriptor raw_spin_lock to
prevent races with concurrent affinity changes.

Signed-off-by: Jing Wu <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 kernel/irq/manage.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2e80724378267..ea97f455eab2a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2801,3 +2801,89 @@ bool irq_check_status_bit(unsigned int irq, unsigned int bitmask)
 	return res;
 }
 EXPORT_SYMBOL_GPL(irq_check_status_bit);
+
+/*
+ * Managed IRQ housekeeping callback: iterate all managed IRQs and ask
+ * the chip to move them off CPUs newly removed from HK_TYPE_MANAGED_IRQ.
+ */
+static void irq_hk_apply(enum hk_type type)
+{
+	cpumask_var_t hk_mask;
+	struct irq_desc *desc;
+	unsigned int irq;
+
+	if (!alloc_cpumask_var(&hk_mask, GFP_KERNEL))
+		return;
+
+	/*
+	 * Snapshot the new HK_TYPE_MANAGED_IRQ mask under an RCU read lock
+	 * before iterating IRQ descriptors.  The lockdep annotation in
+	 * housekeeping_cpumask() requires an RCU read-side critical section
+	 * for runtime-mutable types.
+	 */
+	rcu_read_lock();
+	cpumask_copy(hk_mask, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
+	rcu_read_unlock();
+
+	irq_lock_sparse();
+
+	for_each_active_irq(irq) {
+		desc = irq_to_desc(irq);
+		if (!desc || !desc->action)
+			continue;
+
+		/*
+		 * Only managed interrupts are selected: they have
+		 * IRQF_AFFINITY_MANAGED set, meaning the kernel owns their
+		 * affinity.  User-controlled IRQs are intentionally skipped.
+		 *
+		 * When the intersection of the current affinity mask and the
+		 * new housekeeping mask is non-empty, re-apply the restricted
+		 * affinity to migrate the IRQ away from newly isolated CPUs.
+		 * If the intersection is empty (all serving CPUs are now
+		 * isolated), the IRQ is left on its current CPU temporarily;
+		 * handling that case (IRQ shutdown / re-startup) is left for
+		 * a follow-up.
+		 */
+		if (irqd_affinity_is_managed(&desc->irq_data)) {
+			const struct cpumask *mask;
+			struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);
+
+			raw_spin_lock_irq(&desc->lock);
+			mask = irq_data_get_affinity_mask(&desc->irq_data);
+			cpumask_and(tmp, mask, hk_mask);
+			if (cpumask_intersects(tmp, cpu_online_mask))
+				irq_do_set_affinity(&desc->irq_data, tmp, false);
+			raw_spin_unlock_irq(&desc->lock);
+		}
+	}
+
+	irq_unlock_sparse();
+	free_cpumask_var(hk_mask);
+}
+
+static int irq_hk_validate(enum hk_type type,
+			   const struct cpumask *cur_mask,
+			   const struct cpumask *new_mask)
+{
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -EOPNOTSUPP;
+	return 0;
+}
+
+static struct housekeeping_cbs irq_hk_cbs = {
+	.name		= "genirq/managed",
+	.pre_validate	= irq_hk_validate,
+	.apply		= irq_hk_apply,
+};
+
+static int __init irq_hk_init(void)
+{
+	int ret;
+
+	ret = housekeeping_register_cbs(HK_TYPE_MANAGED_IRQ, &irq_hk_cbs);
+	if (ret)
+		pr_info("genirq: managed IRQ runtime migration disabled (%d)\n", ret);
+	return 0;
+}
+late_initcall(irq_hk_init);

-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox