Linux cgroups development
 help / color / mirror / Atom feed
* Re: [PATCH v2 07/16] mm/slab: replace struct partial_context with slab_alloc_context
From: Harry Yoo @ 2026-06-11  6:05 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-7-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 796 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Refactor get_from_partial_node(), get_from_any_partial(),
> get_from_partial() and ___slab_alloc().
> 
> Remove struct partial_context, which used to be more substantial but
> shrank as part of the sheaves conversion. Instead pass gfp_flags and
> pointer to the new slab_alloc_context, which together is a superset of
> partial_context.
> 
> This means alloc_flags are now available and we can use them to
> determine if spinning is allowed, further reducing false positive "not
> allowed" in the slow path due to gfp flags lacking __GFP_RECLAIM.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] cgroup: rdma: free idle pools during cgroup teardown
From: Tao Cui @ 2026-06-11  5:25 UTC (permalink / raw)
  To: Ren Wei, cgroups
  Cc: tj, hannes, mkoutny, pandit.parav, yuantan098, zcliangcn, bird,
	tr0jan, d4n.for.sec
In-Reply-To: <9eb365a37ab83f38686007f8a61a656759d39bd7.1781092143.git.d4n.for.sec@gmail.com>

Hi,

在 2026/6/11 02:13, Ren Wei 写道:
> From: Daming Li <d4n.for.sec@gmail.com>
> 
> rdmacg_css_offline() converts each pool to all-max limits so the
> existing reclaim path can free it after the last uncharge. However,
> zero-usage pools are already reclaimable at that point and leaving them
> linked until rdmacg_css_free() lets later device teardown hit a
> use-after-free when free_cg_rpool_locked() deletes cg_node from a freed
> cgroup list head.
> 
> Free zero-usage pools directly from rdmacg_css_offline() while holding
> rdmacg_mutex. This keeps the existing reclaim rule, avoids new lifetime
> states, and ensures a cgroup cannot be freed with reclaimable rdmacg
> pools still attached.
Looks good to me.

One minor note: the offline path skips rpool_has_persistent_state()
and frees idle pools unconditionally. This means peak/event stats are
lost earlier than before (at offline vs. at free). This is fine given
the cgroup is dying, and css_free() cleans up remaining pools anyway.

Reviewed-by: Tao Cui <cuitao@kylinos.cn>

Thanks,
--
Tao

^ permalink raw reply

* Re: [PATCH v2 06/16] mm/slab: add alloc_flags to slab_alloc_context
From: Harry Yoo @ 2026-06-11  5:06 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-6-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1134 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Add alloc_flags as a new field to the slab_alloc_context helper struct,
> so we can pass it to more functions in the slab implementation without
> adding another function parameter.
> 
> Start checking them via alloc_flags_allow_spinning() in
> alloc_single_from_new_slab() (where we can drop the allow_spin
> parameter) and ___slab_alloc(). This further reduces false-positive
> spinning-not-allowed from allocations that are not kmalloc_nolock() but
> lack __GFP_RECLAIM flags.
> 
> _kmalloc_nolock_noprof() initializes ac.alloc_flags using its flags that
> are SLAB_ALLOC_TRYLOCK. slab_alloc_node() and __kmem_cache_alloc_bulk()
> are not reachable from kmalloc_nolock() and all their callers expect
> spinning to be allowed, so they can use SLAB_ALLOC_DEFAULT. This is
> temporary as the scope of slab_alloc_context will further move to the
> callers, making the alloc_flags usage more obvious.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 05/16] mm/slab: introduce alloc_flags and SLAB_ALLOC_TRYLOCK
From: Harry Yoo @ 2026-06-11  4:57 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-5-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1457 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Similarly to the page allocators, introduce slab-allocator specific
> alloc flags that internally control allocation behavior in addition to
> gfp_flags, without occupying the limited gfp flags space.
> 
> Introduce the first flag SLAB_ALLOC_TRYLOCK that behaves similarly to
> page allocator's ALLOC_TRYLOCK and will be used to reimplement
> kmalloc_nolock()'s "!allow_spin" behavior. That currently relies on
> gfpflags_allow_spinning() and thus the lack of both __GFP_RECLAIM flags,
> importantly __GFP_KSWAPD_RECLAIM. This can give false-positive results
> e.g. in early boot with a restricted gfp_allowed_mask.
> 
> Also introduce alloc_flags_allow_spinning() to replace the usage of
> gfpflags_allow_spinning().
> 
> Start using alloc_flags and the new check first in alloc_from_pcs() and
> __pcs_replace_empty_main(). This means some slab allocations that were
> falsely treated as kmalloc_nolock() due to their gfp flags will now have
> higher chances of succeed, and this will further increase with followup
> changes.
> 
> Remove a WARN_ON_ONCE() from refill_objects() as it's now legitimate to
> reach it from a slab allocation that's not _nolock() and yet lacks
> __GFP_KSWAPD_RECLAIM for other reasons.
>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 04/16] mm/slab: introduce slab_alloc_context
From: Harry Yoo @ 2026-06-11  4:49 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-4-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 693 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> Similarly to page allocator's struct alloc_context, introduce a helper
> struct to hold a part of the allocation arguments. This will allow
> reducing the number of parameters in many functions of the
> implementation, and extend them easily if needed.
> 
> For now, make it hold the caller address and the originally requested
> allocation size.
> 
> Convert alloc_single_from_new_slab(), __slab_alloc_node() and
> ___slab_alloc(). No functional change intended.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 01/16] mm/slab: do not limit zeroing to orig_size when only red zoning is enabled
From: Harry Yoo @ 2026-06-11  4:28 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, stable
In-Reply-To: <20260610-slab_alloc_flags-v2-1-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1223 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> When init (zeroing) on allocation is requested, for kmalloc() we
> generally have to zero the full object size even if a smaller size is
> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
> 
> But if we track the requested size, krealloc() uses that information to
> do the right thing. With red zoning also enabled, any unused size
> became part of the red zone, so it must not be zeroed.
> 
> However the check is imprecise, and will trigger also when only
> SLAB_RED_ZONE is enabled without SLAB_STORE_USER. This means enabling
> red zoning alone can compromise krealloc()'s __GFP_ZERO contract.
> 
> Fix this by using slub_debug_orig_size() instead, which is the exact
> check for whether the requested size is tracked. We don't need to care
> if red zoning is also enabled or not. Also update and expand the
> comment accordingly.
> 
> Fixes: 9ce67395f5a0 ("mm/slub: only zero requested size of buffer for kzalloc when debug enabled")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 02/16] mm/slab: do not init any kfence objects on allocation
From: Harry Yoo @ 2026-06-11  3:19 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Andrey Konovalov, Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 6048 bytes --]



On 6/11/26 12:40 AM, Vlastimil Babka (SUSE) wrote:
> When init (zeroing) on allocation is requested, for kmalloc() we
> generally have to zero the full object size even if a smaller size is
> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.

Oh, today I learned...

> When we end up allocating a kfence object, kfence perfoms the zeroing on
> its own because has its own redzone beyond the requested size. Thus
> slab_post_alloc_hook() has an 'init' parameter which has to be evaluated
> in all callers (via slab_want_init_on_alloc()) and should be false for
> kfence allocations.

TIL again :D

> For kfence allocations in slab_alloc_node() this is achieved by subtly
> skipping over the slab_want_init_on_alloc() call.

Indeed subtle and I didn't realize this.

> Other callers (i.e.
> kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
> if they do end up with a kfence allocation. This is only subtly not a
> problem, as those are not kmalloc allocations and thus the "requested
> size" equals s->object_size and thus it cannot interfere with kfence's
> redzone.

Right.

> There's just a unnecessary double zeroing (in both kfence and
> slab_post_alloc_hook()), but it's all very fragile and contradicts the
> comment in kfence_guarded_alloc().

Right.

> Remove this subtlety and simplify the code by eliminating the init
> parameter from slab_post_alloc_hook() and make it call
> slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
> check before performing the memset, which will start doing the right
> thing for all callers of slab_post_alloc_hook().

Great, more straightforward!

> This potentially adds overhead of the is_kfence_address() check to
> allocation hotpath, but that one is designed to be as small as possible,
> and it's only evaluated if zeroing is about to happen. This means (aside
> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
> zeroing itself comes with an overhead likely larger than the added
> check.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/kfence/core.c |  2 +-
>  mm/slub.c        | 23 ++++++++---------------
>  2 files changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index e2ee8f1aaccf..8e5264d3ddbf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4565,9 +4565,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>  
>  static __fastpath_inline
>  bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> -			  gfp_t flags, size_t size, void **p, bool init,
> +			  gfp_t flags, size_t size, void **p,
>  			  unsigned int orig_size)
>  {
> +	bool init = slab_want_init_on_alloc(flags, s);
>  	unsigned int zero_size = s->object_size;
>  	bool kasan_init = init;
>  	size_t i;
> @@ -4608,7 +4609,8 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  	for (i = 0; i < size; i++) {
>  		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
>  		if (p[i] && init && (!kasan_init ||
> -				     !kasan_has_integrated_init()))
> +				     !kasan_has_integrated_init())
> +				 && !is_kfence_address(p[i]))

I hope we could make it bit more verbose and straightforward,
something like:

diff --git a/mm/slub.c b/mm/slub.c
index 5d7ea72ebebd..29cf4590f9d9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4573,7 +4573,6 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
gfp_t flags, size_t size,
 {
 	bool init = slab_want_init_on_alloc(flags, s);
 	unsigned int zero_size = s->object_size;
-	bool kasan_init = init;
 	size_t i;
 	gfp_t init_flags = flags & gfp_allowed_mask;

@@ -4591,29 +4590,37 @@ bool slab_post_alloc_hook(struct kmem_cache *s,
gfp_t flags, size_t size,
 	if (slub_debug_orig_size(s))
 		zero_size = ac->orig_size;

-	/*
-	 * When slab_debug is enabled, avoid memory initialization integrated
-	 * into KASAN and instead zero out the memory via the memset below with
-	 * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
-	 * cause false-positive reports. This does not lead to a performance
-	 * penalty on production builds, as slab_debug is not intended to be
-	 * enabled there.
-	 */
-	if (__slub_debug_enabled())
-		kasan_init = false;
-
-	/*
-	 * As memory initialization might be integrated into KASAN,
-	 * kasan_slab_alloc and initialization memset must be
-	 * kept together to avoid discrepancies in behavior.
-	 *
-	 * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
-	 */
 	for (i = 0; i < size; i++) {
-		p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
-		if (p[i] && init && (!kasan_init ||
-				     !kasan_has_integrated_init())
-				 && !is_kfence_address(p[i]))
+		bool skip_init = false;
+
+		if (is_kfence_address(p[i])) {
+			/*
+			 * kfence zeroes the object instead of SLUB to avoid
+			 * overwriting its own redzone, and zeroing of
+			 * s->object_size will corrupt it.
+			 */
+			skip_init = true;
+		} else if (__slub_debug_enabled()) {
+			/*
+			 * KASAN never zeroes memory when slab_debug is enabled
+			 * to avoid overwriting SLUB redzones. This does not
+			 * lead to a performance penalty on production builds,
+			 * as slab_debug is not intended to be enabled there.
+			 */
+			skip_init = false;
+		} else if (kasan_has_integrated_init()) {
+			/*
+			 * ARM64 can set memory tags and zero the memory using
+			 * a single instruction. Since HW_TAGS KASAN uses that
+			 * while tagging the object, a separate zeroing is
+			 * unnecessary unless slab_debug is enabled.
+			 */
+			skip_init = true;
+		}
+
+		p[i] = kasan_slab_alloc(s, p[i], init_flags, init && skip_init);
+		/* memset and hooks come after KASAN as p[i] might get tagged */
+		if (p[i] && init && !skip_init)
 			memset(p[i], 0, zero_size);
 		if (alloc_flags_allow_spinning(ac->alloc_flags))
 			kmemleak_alloc_recursive(p[i], s->object_size, 1,


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related

* Re: [PATCH] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Waiman Long @ 2026-06-11  2:50 UTC (permalink / raw)
  To: Farhad Alemi, Andrew Morton, David Hildenbrand, Gregory Price
  Cc: Farhad Alemi, Yury Norov, Joshua Hahn, Zi Yan, Matthew Brost,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Rasmus Villemoes, linux-mm, linux-kernel, cgroups, stable
In-Reply-To: <CA+0ovCg05rUk1-3k2ysdxmbcER8aG-wVh9SSTrrbp6LPWpPHYA@mail.gmail.com>

On 6/9/26 7:57 PM, Farhad Alemi wrote:
> cpuset_update_tasks_nodemask() rebinds a task's own mempolicy to the
> cpuset's effective, online mems (newmems, from guarantee_online_mems()),
> but rebinds that task's VMA mempolicies to the *configured* mask instead:
>
> 	cpuset_change_task_nodemask(task, &newmems);
> 	...
> 	mpol_rebind_mm(mm, &cs->mems_allowed);
>
> On the default (v2) hierarchy a cpuset that has never had cpuset.mems
> written keeps mems_allowed empty while effective_mems is inherited
> non-empty from the parent, and tasks may be attached to it (the
> empty-mems attach check is v1-only).  A subsequent rebind -- e.g. from a
> CPU hotplug event walking the cpuset -- then calls mpol_rebind_mm() with
> an empty mask.  For a VMA policy created with MPOL_F_RELATIVE_NODES this
> reaches mpol_relative_nodemask() ->
> nodes_fold(..., nodes_weight(cs->mems_allowed) == 0) -> bitmap_fold(),
> whose set_bit(oldbit % sz, dst) divides by zero:
>
>    Oops: divide error: 0000 [#1] SMP KASAN NOPTI
>    RIP: 0010:bitmap_fold+0x5e/0xb0
>     mpol_rebind_nodemask
>     mpol_rebind_mm
>     cpuset_update_tasks_nodemask
>     cpuset_handle_hotplug
>     sched_cpu_deactivate
>     cpuhp_thread_fun
>
> cs->mems_allowed is the only nodemask in this function that is not the
> effective set: the task-policy rebind, the page-migration target and
> cs->old_mems_allowed all use newmems.  The sibling cpuset_attach() path
> already rebinds VMA policies against the effective mems
> (cpuset_attach_nodemask_to = cs->effective_mems) and explicitly notes
> that mems_allowed can be empty under hotplug.  Rebind the VMA policies to
> newmems too: it is guaranteed non-empty by guarantee_online_mems(), which
> fixes the divide-by-zero, and it makes the VMA policies consistent with
> the task policy and with the nodes the task is actually allowed to use.
>
> Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Cc: stable@vger.kernel.org
> ---
>   kernel/cgroup/cpuset.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2649,7 +2649,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>
>   		migrate = is_memory_migrate(cs);
>
> -		mpol_rebind_mm(mm, &cs->mems_allowed);
> +		mpol_rebind_mm(mm, &newmems);
>   		if (migrate)
>   			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
>   		else

Could you change it to &cs->effecitve_mems instead? For v2, 
effective_mems will never be empty.

In fact, this is part of the following patch

https://lore.kernel.org/lkml/20260604150229.414135-2-longman@redhat.com/

Given that this bug can crash the kernel, it should be separated out as 
a separate patch.

Cheers,
Longman



^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-10 23:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aik_ddHymus2DJ6D@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 06:41:57AM -0400, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > 
> > >    __GFP_THISNODE cannot be overloaded to do anything useful here.
> > 
> > Let me clarify, I meant to say, let's use a nodemask for allocation
> > and __GFP_THISNODE gets us to the node we desire, if that is the only
> > node. My earlier comment might not have been clear.
> > 
> 
> I've been tested an stripped back patch set where I drop all FALLBACK
> entries for private nodes (including for itself) and only keep the
> NOFALLBACK entry for private nodes.
> 
> This effectively isolates the nodes for any allocation without
> __GFP_THISNODE.
> 
> This also precludes these nodes from ever using non-mbind mempolicies,
> which I think is a completely reasonable compromise and something I was
> already expecting we would do.
> 
> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> which causes spillage into private nodes because slub allows private
> nodes in its mask.  I think this is fixable.
> 

Agreed.

> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> code, etc), but it seems like fully dropping the FALLBACK entries and
> requiring __GFP_THISNODE might be sufficient.
>
> ~Gregory

That's good progress, thanks for the update!

Balbir

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-10 23:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aiFtJFqkpbZ9qFvM@gourry-fedora-PF4VCD3F>

On Thu, Jun 04, 2026 at 01:18:44PM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> > 
> > My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> > need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> > Very similar to how not all ZONE_DEVICE memory is homogenous.
> >
> 
> Can you more precise about your definition of homogeneous here?
> 
> Are you saying not all memory on a private node will be homogeneous?
>    While possible, I would argue that you should not do this and
>    should instead prefer to use multiple nodes - 1 per memory class.
> 
> Are you saying not all private nodes will be homogenous?
>    I don't see the issue with this.

Yes, I meant, nodes might belong to different devices. These might not
want fallover allocations, for example __GFP_PRIVATE falling back to
unwanted nodes.

> 
> > > 
> > > Agreed, but also one which can be deferred and played with since it's
> > > all kernel-internal.  None of this should have UAPI implications, and we
> > > need need to accept that we're going to get it wrong on the first try.
> > > 
> > 
> > Agreed that we might get the design wrong, until we fix it up. I feel
> > that __GFP_PRIVATE should be an evolution of the design to that point.
> >
> 
> Possibly.  If we can't guarantee isolation without __GFP_PRIVATE, then
> we probably can't merge the baseline without it.
> 

I'll rethink about this, but I am concerned that __GFP_PRIVATE is too
broad, in fact it breaks isolation by allocating from any private
device. Again this is a function of how fallback lists are organized.

> > > Because pagecache pages are associated with potentially many VMAs.
> > > 
> > > The fault can be a soft fault or a hard fault.  On soft fault - the page
> > > was already present, and will simply fault into VMA without being
> > > migrated.
> > > 
> > 
> > Let's split this into two:
> > 
> > 1. unmapped page cache is never impacted by mempolicy and should not
> >    end up on private memory nodes
> > 2. For shared pages, mempolicy would be hard, but it would need to
> >    be on a set of nodes backed by private memory, depending on mbind()
> >    policy
> >
> ... snip ...
> > 
> > I'd need to think more about this. For now, my basic requirement would
> > be that unmapped page cache should not come from/to private nodes.
> > 
> 
> This does not fully describe the problem.
> 
> A file can be opened and cached as unmapped page cache, and then mapped
> at a later time - at which point the mapped copy would share the filemap
> page cache page.
> 
> Worse, because it's file-backed, you can have the memory faulted onto
> your remote node - reclaimed - and the faulted back in via the process
> accessing the file via unmapped operations (read/write), at which point
> you've had a silent migration occur.
> 
> Basically consider
> 
> Process A:
>    fd = open("myfile", ..., RO);
>    read(fd, ...);  /* mm/filemap.c fills page cache */
> 
> Process B:
>    fd = open("myfile", ...);
>    mem = mmap(fd, ...);
>    mbind(mem, ..., private_node);
>    for page in mem:
>        int tmp = mem[page]; /* fault into vma */
> 
> The result of Process A running first is Process B thinks it has faulted
> the memory onto private_node, but in reality it's taking soft faults and
> just getting the filemap folio mapped in.
> 
> If you wanted mbind() support from the start, we would have to limit
> applicability to anon memory only.
> 
> Shared anon memory is different, as there is a radix tree that deals
> with a shared mempolicy state.

Ack, need to think through this.

> 
> > 
> > I am open to this, I was coming from the blueprint approach of:
> > - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> >   what features to change or make specific to the implementation
> >
> 
> N_MEMORY essentially states:
> 	"This is normal memory touch it however you like"
> 
> N_MEMORY_PRIVATE (_MANAGED, w/e) says
> 	"This is NOT normal memory, there are special rules here"
> 
> So, no, lets not mimic N_MEMORY.  This is a "closed by default" design,
> while N_MEMORY is an "open by default" design.  This design choice is
> explicit to make reasoning about these nodes feasible.
> 
> > > This is informed by a single use case / device.
> > > 
> > > There are users / devices that don't want any UAPI for their memory,
> > > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > > reclaim, etc).
> > > 
> > 
> > But then, why do they need NUMA nodes? Do we have a list of use cases?
> >
> 
> So far i have collected:
> 
> - Network accelerators carrying their own memory for message buffers
> - GPUs with semi-general-purpose working memory across coherent links
> - Acceptionally slow distributed memory that you do not want fallback
>   allocations to (so you want to deliberately tier what lands there)
> - Compressed memory (just another form of accelerator really) which
>   has *special access rules* (i.e. writes need to be controlled)
> 
> In most if not all of these cases, the right abstraction to reason about
> where memory *should come from* IS a NUMA node.
> 
> - the network stack can be taught to check if the target device has a
>   node with memory and prefer that node over local memory
> 
> - accelerators can be given private nodes to manage memory using
>   core mm/ components, without worrying that general kernel operation
>   will put unrelated memory on those nodes or do things like migrate
>   your pages out from under you (unless your driver/service requested
>   that).
> 
> the tiering application should be somewhat obvious / trivial.
> 
> > > 
> > > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > > operations access private nodes removed from fallback lists are reached
> > > via something like the possible / online nodemask.
> > > 
> > > I remember, maybe a year ago, there were per-node allocations happening
> > > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > > I'm trying to re-collect that data now.
> > > 
> > 
> > Thanks, I look forward to the next set of patches. Let me know if I
> > can help test what's on the list or if you want me to wait for the next
> > round
> >
> 
> Really I want to get the minimized set out the door so we can start
> breaking this up by feature (reclaim, mempolicy, etc), because trying to
> reason about it as a whole is infeasible - and I cannot be the single
> arbiter of every use case (I simply do not have sufficient context).
> 
> I'm reworking it all as we speak.
> 

Look forward to it

Balbir

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 22:18 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <d01fb1ed-2418-42ee-aea2-37f9a5c5729c@kernel.org>

On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> 
> At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> context might be better. I think there was also talk about how the memalloc_*
> interface might be a better way forward. Maybe we would start giving the
> allocator more context ("we are allocating a folio").
> 
> The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
> 

Ok, this was easier to test than I expected, and hugetlb is indeed a
stickler.  We can't get there 100% with just MEMALLOC_FOLIO, we still
need a MEMALLOC_PRIVATE - specifically because of users like hugetlb.

hugetlb uses __GFP_THISNODE to do its allocations, and all hugetlb
allocations are folio allocations - so the code you shared by itself
does not gate hugetlb from spilling into private nodes.

That means we still need something like this in hugetlb:

  if (node_is_private(nid))
      /* fail allocation */

HOWEVER... if you have MEMALLOC_PRIVATE - you make the allocation
failure a *page allocator* problem, and it serves exactly the same
purpose that __GFP_PRIVATE did.

the resulting code is two lines in my anondax driver:

    unsigned int priv_flags = memalloc_private_save();
    ret = do_anonymous_page_node(vmf, dev_dax->target_node);
    memalloc_private_restore(priv_flags);

No special hugetlb, slab, arch code handling - they all just fail
to allocate / fall back.  If they fail - it means that code is using
a bad nodemask and we need to go fix it (exactly what we want!)

I think additionally, we might be able to repurpose MEMALLOC_PRIVATE
flag for Brendan's needs as well [1].

Their goal (IIRC) was to have a pile of unmapped blocks that could
be opportunistically converted to normal memory, but otherwise left
unmapped and sitting in the buddy.

Same thing - different filter point (blocks vs nodes).

If you set MEMALLOC_PRIVATE - it makes private node allocations
possible, and "private block" access (without conversion) possible.

Otherwise private nodes are unreachable, and private blocks would be
treated like CMA (last-resort stealing, lazy-direct-mapping).

And they stack (private blocks on private nodes :V).

I don't have enough time looking at his proposal, but it seems like we
can kill two birds with one stone on this.

[1] https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/

~Gregory

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 20:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <d01fb1ed-2418-42ee-aea2-37f9a5c5729c@kernel.org>

On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 18:37, Gregory Price wrote:
> > On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/10/26 12:41, Gregory Price wrote:
> > 
> > So, I remember this being asked, and I didn't fully grok the request.
> > 
> > I'm still not sure I fully understand the question, so apologies if I'm
> > answer the wrong things here.
> > 
> > I understand this question in two ways:
> > 
> >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
> 
> Yes. Can we only allow folios to be allocated from private memory nodes. So let
> me reply to that one below.
> 
... snip ...
> 
> At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> context might be better. I think there was also talk about how the memalloc_*
> interface might be a better way forward. Maybe we would start giving the
> allocator more context ("we are allocating a folio").
> 
> The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>

Ok, the mental gap I have is not knowing the full context behind
memalloc.  I'll take this and do some reading / prototyping, but
this looks entirely reasonable.

I will still probably send the next RFC version tomorrow or friday,
as I want to get some eyes on the __GFP_PRIVATE-less pattern.

Also, I made a new `anondax` driver which enables userland testing
of this functionality without any specialty hardware.

tl;dr:

fd = open("/dev/anondax0.0", ....);
buf = mmap(fd, ...);
buf[0] = 0xDEADBEEF; /* fault to anondax driver */

static vm_fault_t anon_dax_fault(struct vm_fault *vmf)
{
        struct dev_dax *dev_dax = vmf->vma->vm_file->private_data;
        vm_fault_t ret;
        int id;

        id = dax_read_lock();
        if (!dax_alive(dev_dax->dax_dev))
                ret = VM_FAULT_SIGBUS;
        else
                ret = do_anonymous_page_node(vmf, dev_dax->target_node);
        dax_read_unlock(id);

        if (ret & VM_FAULT_OOM)
                return VM_FAULT_SIGBUS;
        return ret ? ret : VM_FAULT_NOPAGE;
}

With:
  qemu-system-x86_64 -m 5G \
    -object memory-backend-ram,id=m0,size=4G -numa node,nodeid=0,memdev=m0 \
    -object memory-backend-ram,id=m1,size=1G -numa node,nodeid=1,memdev=m1 \
    -append "... memmap=0x40000000!0x140000000"

Voila - buddy-managed private anonymous memory (1G region)

No need to reinvent page_alloc.c or fault handling :]

This can be used to hammer on reclaim/compaction/whatever support
without needing any particular hardware setup, and in fact it gives
some memory devices a path to support in userland while standards
get worked out.

do_anonymous_page_node is a bit of a bodge right now but I just haven't
fleshed it out yet.  The idea is - don't reinvent the fault path, just
provide the appropriate context to memory.c to do the right thing.

If this is acceptable, I imagine whatever interface gets implemented
will carry an in-tree driver export only, similar to hotplug/kmem.

> From 64aaff5f40497201ecc089c3339df6576184c433 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Wed, 10 Jun 2026 20:55:49 +0200
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>  include/linux/sched.h    |  2 +-
>  include/linux/sched/mm.h | 11 +++++++++++
>  mm/mempolicy.c           | 14 ++++++++++++--
>  mm/page_alloc.c          |  7 ++++++-
>  4 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee06cba5c6f5..9c850b7be6bf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1778,7 +1778,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF__MEMALLOC_FOLIO	0x00800000	/* Allocating a folio that can end up on
> private memory nodes */
>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with
> cpus_mask */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 95d0040df584..2101a447c084 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -471,6 +471,17 @@ static inline void memalloc_pin_restore(unsigned int flags)
>  	memalloc_flags_restore(flags);
>  }
> 
> +static inline unsigned int memalloc_folio_save(void)
> +{
> +	return memalloc_flags_save(PF_MEMALLOC_FOLIO);
> +}
> +
> +static inline void memalloc_folio_restore(unsigned int flags)
> +{
> +	memalloc_flags_restore(flags);
> +}
> +
> +
>  #ifdef CONFIG_MEMCG
>  DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  /**
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..a78b0e5a1fce 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2506,8 +2506,13 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned
> int order,
>  struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  		struct mempolicy *pol, pgoff_t ilx, int nid)
>  {
> -	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> +	struct page *page;
> +	int flags;
> +
> +	flags = memalloc_folio_save();
> +	page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
>  			ilx, nid);
> +	memalloc_folio_restore(flags);
>  	if (!page)
>  		return NULL;
> 
> @@ -2588,7 +2593,12 @@ EXPORT_SYMBOL(alloc_pages_noprof);
> 
>  struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
>  {
> -	return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> +	struct folio *folio;
> +	int flags;
> +
> +	flags = memalloc_folio_save();
> +	folio = page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> +	memalloc_folio_restore(flags);
> +	return folio;
>  }
>  EXPORT_SYMBOL(folio_alloc_noprof);
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee902a468c2f..37434b37f7af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5345,8 +5345,13 @@ EXPORT_SYMBOL(__alloc_pages_noprof);
>  struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int
> preferred_nid,
>  		nodemask_t *nodemask)
>  {
> -	struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> +	struct page *page;
> +	int flags;
> +
> +	flags = memalloc_folio_save();
> +	page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>  					preferred_nid, nodemask);
> +	memalloc_folio_restore(flags);
>  	return page_rmappable_folio(page);
>  }
>  EXPORT_SYMBOL(__folio_alloc_noprof);
> -- 
> 2.43.0
> 
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-10 18:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>

On 6/10/26 18:37, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/10/26 12:41, Gregory Price wrote:
>>>
>>> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
>>> which causes spillage into private nodes because slub allows private
>>> nodes in its mask.  I think this is fixable.
>>>
>>> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
>>> code, etc), but it seems like fully dropping the FALLBACK entries and
>>> requiring __GFP_THISNODE might be sufficient.
>>
>> Sorry, I haven't been able to follow up so far, and not sure if that's what you
>> are discussing here ...
>>
>> After the LSF/MM session, I was wondering, whether if we focus on allowing only
>> folios allocations to end up on private memory nodes for now: could the
>> __GFP_THISNODE approach work there?
>>
>> Essentially, disallow any allocations on non-folio paths, and allow folio
>> allocation only with __GFP_THISNODE set.
>>
>> I have to find time to read the other mails in this thread, on my todo list.
>>
>> So sorry if that is precisely what is being discussed here.
>>
> 
> So, I remember this being asked, and I didn't fully grok the request.
> 
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
> 
> I understand this question in two ways:
> 
>   1) Can we disallow PAGE allocation and limit this to FOLIO allocation

Yes. Can we only allow folios to be allocated from private memory nodes. So let
me reply to that one below.

>   2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
> 
> 
> 1) Can we disallow page allocation and limit this to folios?
> 
> No, I don't think so.
> 
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
> 
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask)
> {
>         struct page *page;
> 
>         page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
>         if (page)
>                 set_page_refcounted(page);
>         return page;
> }
> 
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
>                 nodemask_t *nodemask)
> {
>         struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>                                         preferred_nid, nodemask);
> 	return page_rmappable_folio(page);
> }

At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
context might be better. I think there was also talk about how the memalloc_*
interface might be a better way forward. Maybe we would start giving the
allocator more context ("we are allocating a folio").

The following is incomplete (esp. hugetlb stuff I assume), just as some idea:

From 64aaff5f40497201ecc089c3339df6576184c433 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Wed, 10 Jun 2026 20:55:49 +0200
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
 include/linux/sched.h    |  2 +-
 include/linux/sched/mm.h | 11 +++++++++++
 mm/mempolicy.c           | 14 ++++++++++++--
 mm/page_alloc.c          |  7 ++++++-
 4 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..9c850b7be6bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1778,7 +1778,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF__MEMALLOC_FOLIO	0x00800000	/* Allocating a folio that can end up on
private memory nodes */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with
cpus_mask */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 95d0040df584..2101a447c084 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -471,6 +471,17 @@ static inline void memalloc_pin_restore(unsigned int flags)
 	memalloc_flags_restore(flags);
 }

+static inline unsigned int memalloc_folio_save(void)
+{
+	return memalloc_flags_save(PF_MEMALLOC_FOLIO);
+}
+
+static inline void memalloc_folio_restore(unsigned int flags)
+{
+	memalloc_flags_restore(flags);
+}
+
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..a78b0e5a1fce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2506,8 +2506,13 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned
int order,
 struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
 		struct mempolicy *pol, pgoff_t ilx, int nid)
 {
-	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
+	struct page *page;
+	int flags;
+
+	flags = memalloc_folio_save();
+	page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
 			ilx, nid);
+	memalloc_folio_restore(flags);
 	if (!page)
 		return NULL;

@@ -2588,7 +2593,12 @@ EXPORT_SYMBOL(alloc_pages_noprof);

 struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
 {
-	return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
+	struct folio *folio;
+	int flags;
+
+	flags = memalloc_folio_save();
+	folio = page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
+	memalloc_folio_restore(flags);
+	return folio;
 }
 EXPORT_SYMBOL(folio_alloc_noprof);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f..37434b37f7af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5345,8 +5345,13 @@ EXPORT_SYMBOL(__alloc_pages_noprof);
 struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int
preferred_nid,
 		nodemask_t *nodemask)
 {
-	struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
+	struct page *page;
+	int flags;
+
+	flags = memalloc_folio_save();
+	page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
 					preferred_nid, nodemask);
+	memalloc_folio_restore(flags);
 	return page_rmappable_folio(page);
 }
 EXPORT_SYMBOL(__folio_alloc_noprof);
-- 
2.43.0


-- 
Cheers,

David

^ permalink raw reply related

* [PATCH 1/1] cgroup: rdma: free idle pools during cgroup teardown
From: Ren Wei @ 2026-06-10 18:13 UTC (permalink / raw)
  To: cgroups
  Cc: tj, hannes, mkoutny, pandit.parav, yuantan098, zcliangcn, bird,
	tr0jan, d4n.for.sec, n05ec
In-Reply-To: <cover.1781092143.git.d4n.for.sec@gmail.com>

From: Daming Li <d4n.for.sec@gmail.com>

rdmacg_css_offline() converts each pool to all-max limits so the
existing reclaim path can free it after the last uncharge. However,
zero-usage pools are already reclaimable at that point and leaving them
linked until rdmacg_css_free() lets later device teardown hit a
use-after-free when free_cg_rpool_locked() deletes cg_node from a freed
cgroup list head.

Free zero-usage pools directly from rdmacg_css_offline() while holding
rdmacg_mutex. This keeps the existing reclaim rule, avoids new lifetime
states, and ensures a cgroup cannot be freed with reclaimable rdmacg
pools still attached.

Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Daming Li <d4n.for.sec@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 kernel/cgroup/rdma.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c
index 9967fb25c563..10ae628d91a7 100644
--- a/kernel/cgroup/rdma.c
+++ b/kernel/cgroup/rdma.c
@@ -587,18 +587,22 @@ static void rdmacg_css_free(struct cgroup_subsys_state *css)
  *
  * This function is called when @css is about to go away and responsible
  * for shooting down all rdmacg associated with @css. As part of that it
- * marks all the resource pool entries to max value, so that when resources are
- * uncharged, associated resource pool can be freed as well.
+ * marks all the resource pool entries to max value, so that active pools can
+ * be freed when resources are uncharged and idle pools can be freed
+ * immediately.
  */
 static void rdmacg_css_offline(struct cgroup_subsys_state *css)
 {
 	struct rdma_cgroup *cg = css_rdmacg(css);
-	struct rdmacg_resource_pool *rpool;
+	struct rdmacg_resource_pool *rpool, *tmp;
 
 	mutex_lock(&rdmacg_mutex);
 
-	list_for_each_entry(rpool, &cg->rpools, cg_node)
+	list_for_each_entry_safe(rpool, tmp, &cg->rpools, cg_node) {
 		set_all_resource_max_limit(rpool);
+		if (rpool->usage_sum == 0)
+			free_cg_rpool_locked(rpool);
+	}
 
 	mutex_unlock(&rdmacg_mutex);
 }
-- 
2.34.1


^ permalink raw reply related

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 16:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <c1b66e7a-bb95-4295-8193-55ceadaaa578@kernel.org>

On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 12:41, Gregory Price wrote:
> > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > 
> > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > which causes spillage into private nodes because slub allows private
> > nodes in its mask.  I think this is fixable.
> > 
> > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > code, etc), but it seems like fully dropping the FALLBACK entries and
> > requiring __GFP_THISNODE might be sufficient.
> 
> Sorry, I haven't been able to follow up so far, and not sure if that's what you
> are discussing here ...
> 
> After the LSF/MM session, I was wondering, whether if we focus on allowing only
> folios allocations to end up on private memory nodes for now: could the
> __GFP_THISNODE approach work there?
> 
> Essentially, disallow any allocations on non-folio paths, and allow folio
> allocation only with __GFP_THISNODE set.
> 
> I have to find time to read the other mails in this thread, on my todo list.
> 
> So sorry if that is precisely what is being discussed here.
> 

So, I remember this being asked, and I didn't fully grok the request.

I'm still not sure I fully understand the question, so apologies if I'm
answer the wrong things here.

I understand this question in two ways:

  1) Can we disallow PAGE allocation and limit this to FOLIO allocation
  2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.


1) Can we disallow page allocation and limit this to folios?

No, I don't think so.

Folio allocations are written in terms of page allocations, we would
have to rewrite folio allocation interfaces and introduce a bunch of
boilerplate for the sake of this.

struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
                int preferred_nid, nodemask_t *nodemask)
{
        struct page *page;

        page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
        if (page)
                set_page_refcounted(page);
        return page;
}

struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
                nodemask_t *nodemask)
{
        struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
                                        preferred_nid, nodemask);
	return page_rmappable_folio(page);
}

At the end of the day, this all reduces to `get_pages_from_freelist`,
and at that level we don't really care about folio vs page.

__GFP_COMP is insufficient to differentiate between a non-folio compound
page and a folio, and __GFP_COMP is passed into __alloc_pages_*
interfaces all over the kernel.

Trying to detach these paths things seems like a horrible rats nest /
not feasible / will create a lot of boilerplate for little value.

(I did not fully understand this request when it was asked, I do
 not fully understand this request not, please let me know if I
 have misunderstood what you were asking).



2) Can we disallow SLAB allocation.

Yeah, but I think a better question is whether there's a difference
between alloc_pages_node() and kmalloc_node() when it all just sinks
to the same fundamental code in mm/page_alloc.c

Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
allocations on the private node w/ __GFP_THISNODE)

On my current set, I don't implement any explicit filtering at all in
mm/page_alloc.c - the filtering is a function of the nodes not being
present in the FALLBACK list and only having a NOFALLBACK list.

What __GFP_THISNODE actually does under the hood is just switch
which zone list (FALLBACK vs NOFALLBACK) is used for the target node.

For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
from *their own FALLBACK* list and only adding them to their NOFALLBACK
list.  That means to reach a private node you MUST use __GFP_THISNODE.

I realize this is confusing, but essentially we don't have to modify
mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
the fallback/nofallback list construction.


Ok, so how does this flush out in practice - and why do I call this
filtering mechanism fragile?

consider kmalloc_node() and __slab_alloc():

kmalloc_node(...)
  └─ ___slab_alloc()     mm/slub.c:4406   pc.flags |= __GFP_THISNODE
      └─ new_slab(s, pc.flags, node)
          └─ allocate_slab(s, flags, node)
              └─ alloc_slab_page(flags, node, oo, …)
                  └─ __alloc_frozen_pages(flags, order, node, NULL);

Slab silently upgrades the page allocator flags here to include
__GFP_THISNODE - even if the user didn't request that behavior.

This is exactly the kind of "spillage" I said was hard to police at LSF.

Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
is using __GFP_THISNODE and how.

For mm/slub.c we can choose to do one of thwo things

  1) 100% refuse slab allocations on private nodes, i.e.:

     kmalloc_node(..., private_nid, __GFP_THISNODE)

     And will fail (return NULL).

  or

  2) Do not upgrade private-node slab requests w/ __GFP_THISNODE
     
     This allows kmalloc_node() to work the same as folio_alloc()
     or alloc_pages() interfaces (__GFP_THISNODE is the key), with
     the understanding that any __GFP_THISNODE user

We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
if the owner wants kmalloc_node(), with the understanding that any
caller using __GFP_THISNODE may get access.

That's the kind of fragility I was trying to avoid.


That said, in practice, I have found that basic kernel operations don't
generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
nothing to prevent anyone from doing so.

So this seems promising...
And then theres arch/powerpc/platforms/powernv/memtrace.c

static u64 memtrace_alloc_node(u32 nid, u64 size)
{
	... snip ...
        page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
                                  __GFP_NOWARN | __GFP_ZERO, nid, NULL);
	... snip ...
}

static int memtrace_init_regions_runtime(u64 size)
{
	... snip ...
        for_each_online_node(nid) {
                m = memtrace_alloc_node(nid, size);
	... snip ...
}

static int memtrace_enable_set(void *data, u64 val)
{
	... snip ...
        if (memtrace_init_regions_runtime(val))
                goto out_unlock;
	... snip ...
}

This is the *exact* pattern I said would be hard to police - and it
doesn't look like a bug, just not informed that private nodes exist.

This is why I'm concerned with trying to depend on __GFP_THISNODE as the
filtering function.

That said, the number of __GFP_THISNODE users is very limited
kernel-wide, so maybe that's an acceptable maintenance burden?

~Gregory

^ permalink raw reply

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
From: Waiman Long @ 2026-06-10 15:42 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef
In-Reply-To: <d4ca5fe7-fd76-47c8-949a-a69916bfcbd4@redhat.com>

On 6/10/26 11:09 AM, Waiman Long wrote:
> On 6/5/26 8:40 AM, Peter Zijlstra wrote:
>> In order to avoid the average CPU fraction avg(F_g_n) becoming tiny 
>> '1/N',
>> assume each cgroup is maximally concurrent and distrubute 'N*weight', 
>> such
>> that:
>>
>>     F_g_n' = N * F_g_n
>>
>> Giving:
>>
>>     avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1
>>
>> And while this sounds like it solves things, remember what that ~ 
>> meant. There
>> is the corner case when a cgroup is minimally loaded, eg a single 
>> runnable
>> task, therefore limit the CPU fraction to that of a nice -20 task to 
>> avoid
>> getting too much load.
>>
>> This last bit is what makes it different from a previous proposal to 
>> allow
>> raising cpu.weight to '100 * N', that would not limit the mininal 
>> concurrency
>> case and results in a very large F_g_n. And just like F_g_n << 1 is
>> problematic, so is F_g_n >> 1 for the exact same reasons (it would 
>> drown the
>> kthreads, but it also risks overflowing the load values).
>>
>> So while this might appear to be a better scheme than the current 
>> default
>> scheme, it doesn't really handle less than maximal concurrency nicely 
>> -- it
>> clips and introduces artificially large weights. So where the 
>> traditional SMP
>> mode works well when nr_tasks << nr_cpus, MAX doesn't work well in 
>> that regime
>> and vice-versa.
>>
>> The meaning of "cpu.weight" would be: weight per allowed CPU.
>>
>> Included for completeness (and infrastructure).
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>   include/linux/cpuset.h |    6 +++++
>>   kernel/cgroup/cpuset.c |   15 ++++++++++++++
>>   kernel/sched/debug.c   |    1
>>   kernel/sched/fair.c    |   52 
>> ++++++++++++++++++++++++++++++++++++++++++++-----
>>   4 files changed, 69 insertions(+), 5 deletions(-)
>>
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
>>   extern void cpuset_cpus_allowed_locked(struct task_struct *p, 
>> struct cpumask *mask);
>>   extern void cpuset_cpus_allowed(struct task_struct *p, struct 
>> cpumask *mask);
>>   extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
>> +extern int cpuset_num_cpus(struct cgroup *cgroup);
>>   extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>>   #define cpuset_current_mems_allowed (current->mems_allowed)
>>   void cpuset_init_current_mems_allowed(void);
>> @@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
>>       return false;
>>   }
>>   +static inline int cpuset_num_cpus(struct cgroup *cgroup)
>> +{
>> +    return num_online_cpus();
>> +}
>> +
>>   static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>>   {
>>       return node_possible_map;
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
>>       return changed;
>>   }
>>   +int cpuset_num_cpus(struct cgroup *cgrp)
>> +{
>> +    int nr = num_online_cpus();
>> +    struct cpuset *cs;
>> +
>> +    if (is_in_v2_mode()) {
>> +        guard(rcu)();
>> +        cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
>> +        if (cs)
>> +            nr = cpumask_weight(cs->effective_cpus);
>> +    }
>> +
>> +    return nr;
>> +}
>
> I just have a question about cgroup v1 support. I am assuming that 
> cgroup v1 without the cpuset_v2_mode mount option is not supported. To 
> fully support cgroup v1, you may have to use guarantee_active_cpus() 
> to return the actual set of CPUs that the task can run on. Also there 
> is a caveat about the arm64 specific task_cpu_possible_mask() for 
> certain arm64 CPUs. That is for 32-bit binary running on 64-bit core 
> which are allowed only on a selected subset of cores within the CPU.
>
> This is probably not what you want to focus on right now, but it will 
> be good to have a comment to list items that are not fully supported 
> here. 

FYI, you may have to take the callback_lock to ensure the stability of 
the effective_cpus mask.

Cheers,
Longman


^ permalink raw reply

* [PATCH v2 16/16] mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Finish the switch away from __GFP_NO_OBJ_EXT by replacing it with
SLAB_ALLOC_NO_RECURSE when allocating empty sheaves. Pass alloc_flags to
[__]alloc_empty_sheaf(). Callers that can't be part of a recursive
kmalloc() chain simply pass SLAB_ALLOC_DEFAULT. Use kmalloc_flags()
instead of kzalloc() for allocating the sheaf.

This leaves __GFP_NO_OBJ_EXT with no users in slab, so stop allowing the
flag in kmalloc_nolock().

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 include/linux/slab.h |  6 +++---
 mm/slub.c            | 31 ++++++++++++++++---------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index b955f3cbb732..43c3d9b51107 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1039,9 +1039,9 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 /**
  * kmalloc_nolock - Allocate an object of given size from any context.
  * @size: size to allocate
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXT
- * allowed. Also __GFP_NOWARN and __GFP_NOMEMALLOC are allowed but added
- * internally thus not necessary.
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT and __GFP_ZERO allowed.  Also
+ * __GFP_NOWARN and __GFP_NOMEMALLOC are allowed but added internally thus not
+ * necessary.
  * @node: node number of the target node.
  *
  * Return: pointer to the new object or NULL in case of error.
diff --git a/mm/slub.c b/mm/slub.c
index 7dfbd0251aa2..5d7ea72ebebd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2756,7 +2756,7 @@ static inline void *setup_object(struct kmem_cache *s, void *object)
 }
 
 static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
-					      unsigned int capacity)
+				unsigned int alloc_flags, unsigned int capacity)
 {
 	struct slab_sheaf *sheaf;
 	size_t sheaf_size;
@@ -2767,10 +2767,10 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 	 * bucket)
 	 */
 	if (s->flags & SLAB_KMALLOC)
-		gfp |= __GFP_NO_OBJ_EXT;
+		alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sheaf_size = struct_size(sheaf, objects, capacity);
-	sheaf = kzalloc(sheaf_size, gfp);
+	sheaf = kmalloc_flags(sheaf_size, gfp | __GFP_ZERO, alloc_flags, NUMA_NO_NODE);
 
 	if (unlikely(!sheaf))
 		return NULL;
@@ -2783,20 +2783,20 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 }
 
 static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
-						   gfp_t gfp)
+				gfp_t gfp, unsigned int alloc_flags)
 {
-	if (gfp & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
 		return NULL;
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 
-	return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity);
+	return __alloc_empty_sheaf(s, gfp, alloc_flags, s->sheaf_capacity);
 }
 
 static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 {
 	/*
-	 * If the sheaf was created with __GFP_NO_OBJ_EXT flag then its
+	 * If the sheaf was created with SLAB_ALLOC_NO_RECURSE flag then its
 	 * corresponding extension is NULL and alloc_tag_sub() will throw a
 	 * warning, therefore replace NULL with CODETAG_EMPTY to indicate
 	 * that the extension for this sheaf is expected to be NULL.
@@ -4689,7 +4689,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 
 	if (!empty) {
-		empty = alloc_empty_sheaf(s, gfp);
+		empty = alloc_empty_sheaf(s, gfp, alloc_flags);
 		if (!empty)
 			return NULL;
 	}
@@ -5063,7 +5063,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
-		sheaf = __alloc_empty_sheaf(s, gfp, size);
+		sheaf = __alloc_empty_sheaf(s, gfp, SLAB_ALLOC_DEFAULT, size);
 		if (!sheaf)
 			return NULL;
 
@@ -5108,7 +5108,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 
 	if (!sheaf)
-		sheaf = alloc_empty_sheaf(s, gfp);
+		sheaf = alloc_empty_sheaf(s, gfp, SLAB_ALLOC_DEFAULT);
 
 	if (sheaf) {
 		sheaf->capacity = s->sheaf_capacity;
@@ -5392,7 +5392,7 @@ static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_f
 
 	VM_WARN_ON_ONCE(alloc_flags_allow_spinning(ac->alloc_flags));
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
-			__GFP_NO_OBJ_EXT | __GFP_NOWARN | __GFP_NOMEMALLOC));
+				      __GFP_NOWARN | __GFP_NOMEMALLOC));
 
 	gfp_flags |= __GFP_NOWARN | __GFP_NOMEMALLOC;
 
@@ -5907,7 +5907,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	if (!allow_spin)
 		return NULL;
 
-	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+	empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
 	if (empty)
 		goto got_empty;
 
@@ -6091,7 +6091,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 		local_unlock(&s->cpu_sheaves->lock);
 
-		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
 
 		if (!empty)
 			goto fail;
@@ -7636,7 +7636,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 		if (!s->sheaf_capacity)
 			pcs->main = &bootstrap_sheaf;
 		else
-			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL, SLAB_ALLOC_DEFAULT);
 
 		if (!pcs->main)
 			return -ENOMEM;
@@ -8502,7 +8502,8 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity);
+		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL,
+				SLAB_ALLOC_DEFAULT, capacity);
 
 		if (!pcs->main) {
 			failed = true;

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 15/16] mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

__GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and
gfp flags are a scarce resource, unlike slab's alloc_flags.

Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as
__GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc()
family function should not recurse into another kmalloc*() for the
purposes of allocating auxiliary structures (obj_ext arrays or sheaves).

First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in
alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags()
function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE
added. This will also pass through SLAB_ALLOC_TRYLOCK so we don't need
to special case kmalloc_nolock() anymore.

Note that until now the kmalloc_nolock() ignored the incoming gfp flags
and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on
the incoming gfp flags (only augmented with __GFP_ZERO), because if
alloc_flags contain SLAB_ALLOC_TRYLOCK, the incoming gfp flags have to
be also compatible with it.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h |  1 +
 mm/slub.c | 13 +++++--------
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 45bfcfb35a9c..509f330654b8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -21,6 +21,7 @@
 #define SLAB_ALLOC_DEFAULT	0x00 /* no flags */
 #define SLAB_ALLOC_TRYLOCK	0x01 /* a kmalloc_nolock() allocation */
 #define SLAB_ALLOC_NEW_SLAB	0x02 /* a flag for alloc_slab_obj_exts() */
+#define SLAB_ALLOC_NO_RECURSE	0x04 /* prevent kmalloc() recursion */
 
 static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
 {
diff --git a/mm/slub.c b/mm/slub.c
index cbb38bd01e46..7dfbd0251aa2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2167,15 +2167,12 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
-	gfp |= __GFP_NO_OBJ_EXT;
+	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
 
-	if (unlikely(!allow_spin))
-		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
-				     slab_nid(slab));
-	else
-		vec = kmalloc_node(sz, gfp | __GFP_ZERO, slab_nid(slab));
+	/* This will use kmalloc_nolock() if alloc_flags say so */
+	vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));
 
 	if (!vec) {
 		/*
@@ -2251,7 +2248,7 @@ static inline void free_slab_obj_exts(struct slab *slab, bool allow_spin)
 	}
 
 	/*
-	 * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+	 * obj_exts was created with SLAB_ALLOC_NO_RECURSE flag, therefore its
 	 * corresponding extension will be NULL. alloc_tag_sub() will throw a
 	 * warning if slab has extensions but the extension of an object is
 	 * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
@@ -2374,7 +2371,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
 	if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
 		return;
 
-	if (flags & __GFP_NO_OBJ_EXT)
+	if (alloc_flags & SLAB_ALLOC_NO_RECURSE)
 		return;
 
 	slab = virt_to_slab(object);

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 14/16] mm/slab: introduce kmalloc_flags()
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

Add this function, named kmalloc_flags(). Right now it's only useful for
these nested allocations, so it doesn't need to optimize build-time
constant sizes like kmalloc() or kmalloc_buckets.

Since we need it to support both normal and non-spinning
kmalloc_nolock() context through the SLAB_ALLOC_TRYLOCK flag, split out
most of the special _kmalloc_nolock_noprof() implementation to
__kmalloc_nolock_noprof() that takes a slab_alloc_context, and make
_kmalloc_nolock_noprof() a simple tail calling wrapper with the proper
context.

kmalloc_flags() can thus determine whether to call
__kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the
given alloc_flags.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h | 13 +++++++++++++
 mm/slub.c | 56 +++++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 4db6d8aa0ee3..45bfcfb35a9c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -11,6 +11,7 @@
 #include <linux/memcontrol.h>
 #include <linux/kfence.h>
 #include <linux/kasan.h>
+#include <linux/slab.h>
 
 /*
  * Internal slab definitions
@@ -26,6 +27,18 @@ static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
 	return !(alloc_flags & SLAB_ALLOC_TRYLOCK);
 }
 
+void *__kmalloc_flags_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags,
+				  unsigned int alloc_flags, int node)
+				  __assume_kmalloc_alignment __alloc_size(1);
+
+static __always_inline __alloc_size(1) void *_kmalloc_flags_noprof(size_t size,
+		gfp_t flags, unsigned int alloc_flags, int node, kmalloc_token_t token)
+{
+	return __kmalloc_flags_noprof(PASS_TOKEN_PARAMS(size, token), flags, alloc_flags, node);
+}
+#define kmalloc_flags_noprof(...)	_kmalloc_flags_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
+#define kmalloc_flags(...)		alloc_hooks(kmalloc_flags_noprof(__VA_ARGS__))
+
 #ifdef CONFIG_64BIT
 # ifdef system_has_cmpxchg128
 # define system_has_freelist_aba()	system_has_cmpxchg128()
diff --git a/mm/slub.c b/mm/slub.c
index 847cad5203b2..cbb38bd01e46 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5386,14 +5386,14 @@ void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
-void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
+static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags,
+				     int node, struct slab_alloc_context *ac)
 {
-	size_t orig_size = size;
-	unsigned int alloc_flags = SLAB_ALLOC_TRYLOCK;
 	struct kmem_cache *s;
 	bool can_retry = true;
 	void *ret;
 
+	VM_WARN_ON_ONCE(alloc_flags_allow_spinning(ac->alloc_flags));
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
 			__GFP_NO_OBJ_EXT | __GFP_NOWARN | __GFP_NOMEMALLOC));
 
@@ -5430,23 +5430,17 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		 */
 		return NULL;
 
-	ret = alloc_from_pcs(s, gfp_flags, alloc_flags, node);
+	ret = alloc_from_pcs(s, gfp_flags, ac->alloc_flags, node);
 	if (ret)
 		goto success;
 
-	struct slab_alloc_context ac = {
-		.caller_addr = _RET_IP_,
-		.orig_size = orig_size,
-		.alloc_flags = alloc_flags,
-	};
-
 	/*
 	 * Do not call slab_alloc_node(), since trylock mode isn't
 	 * compatible with slab_pre_alloc_hook/should_failslab and
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
 	 */
-	ret = __slab_alloc_node(s, gfp_flags, node, &ac);
+	ret = __slab_alloc_node(s, gfp_flags, node, ac);
 
 	/*
 	 * It's possible we failed due to trylock as we preempted someone with
@@ -5469,11 +5463,23 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, gfp_flags, 1, &ret, &ac);
+	slab_post_alloc_hook(s, gfp_flags, 1, &ret, ac);
 
-	ret = kasan_kmalloc(s, ret, orig_size, gfp_flags);
+	ret = kasan_kmalloc(s, ret, ac->orig_size, gfp_flags);
 	return ret;
 }
+
+void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
+{
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_TRYLOCK,
+	};
+
+	return __kmalloc_nolock_noprof(PASS_TOKEN_PARAMS(size, token),
+				       gfp_flags, node, &ac);
+}
 EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
 
 void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
@@ -5527,6 +5533,30 @@ void *__kmalloc_cache_node_noprof(struct kmem_cache *s, gfp_t gfpflags,
 }
 EXPORT_SYMBOL(__kmalloc_cache_node_noprof);
 
+/*
+ * The only version of kmalloc_node() that takes alloc_flags and thus can
+ * determine on its own whether to handle the allocation via kmalloc_nolock() or
+ * normally
+ */
+void *__kmalloc_flags_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags,
+			     unsigned int alloc_flags, int node)
+{
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = alloc_flags,
+	};
+
+	if (alloc_flags_allow_spinning(alloc_flags)) {
+		return __do_kmalloc_node(size, NULL, flags, node,
+				PASS_TOKEN_PARAM(token), &ac);
+	} else {
+		return __kmalloc_nolock_noprof(PASS_TOKEN_PARAMS(size, token),
+					       flags, node, &ac);
+	}
+}
+
+
 static noinline void free_to_partial_list(
 	struct kmem_cache *s, struct slab *slab,
 	void *head, void *tail, int bulk_cnt,

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 13/16] mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

The two flags are added internally so there's no point for warning if
they are passed by the caller as well, so allow them. This will allow
simplifying obj_ext allocation under kmalloc_nolock().

Also it's not necessary to have the extra alloc_gfp variable for adding
the two flags. The original gfp_flags parameter is not used anywhere
except for the warning. So remove alloc_gfp and directly modify and use
gfp_flags everywhere.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 include/linux/slab.h |  3 ++-
 mm/slub.c            | 19 ++++++++++---------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ce1c867dc0ba..b955f3cbb732 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1040,7 +1040,8 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
  * kmalloc_nolock - Allocate an object of given size from any context.
  * @size: size to allocate
  * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXT
- * allowed.
+ * allowed. Also __GFP_NOWARN and __GFP_NOMEMALLOC are allowed but added
+ * internally thus not necessary.
  * @node: node number of the target node.
  *
  * Return: pointer to the new object or NULL in case of error.
diff --git a/mm/slub.c b/mm/slub.c
index 6845e15c148a..847cad5203b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5388,7 +5388,6 @@ EXPORT_SYMBOL(__kmalloc_noprof);
 
 void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, int node)
 {
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
 	size_t orig_size = size;
 	unsigned int alloc_flags = SLAB_ALLOC_TRYLOCK;
 	struct kmem_cache *s;
@@ -5396,7 +5395,9 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	void *ret;
 
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
-				      __GFP_NO_OBJ_EXT));
+			__GFP_NO_OBJ_EXT | __GFP_NOWARN | __GFP_NOMEMALLOC));
+
+	gfp_flags |= __GFP_NOWARN | __GFP_NOMEMALLOC;
 
 	if (unlikely(!size))
 		return ZERO_SIZE_PTR;
@@ -5415,7 +5416,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 retry:
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
 		return NULL;
-	s = kmalloc_slab(size, NULL, alloc_gfp, PASS_TOKEN_PARAM(token));
+	s = kmalloc_slab(size, NULL, gfp_flags, PASS_TOKEN_PARAM(token));
 
 	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
 		/*
@@ -5429,7 +5430,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		 */
 		return NULL;
 
-	ret = alloc_from_pcs(s, alloc_gfp, alloc_flags, node);
+	ret = alloc_from_pcs(s, gfp_flags, alloc_flags, node);
 	if (ret)
 		goto success;
 
@@ -5445,7 +5446,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
 	 */
-	ret = __slab_alloc_node(s, alloc_gfp, node, &ac);
+	ret = __slab_alloc_node(s, gfp_flags, node, &ac);
 
 	/*
 	 * It's possible we failed due to trylock as we preempted someone with
@@ -5458,8 +5459,8 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 		size = s->object_size + 1;
 		/*
 		 * Another alternative is to
-		 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
-		 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+		 * if (memcg) gfp_flags &= ~__GFP_ACCOUNT;
+		 * else if (!memcg) gfp_flags |= __GFP_ACCOUNT;
 		 * to retry from bucket of the same size.
 		 */
 		can_retry = false;
@@ -5468,9 +5469,9 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, alloc_gfp, 1, &ret, &ac);
+	slab_post_alloc_hook(s, gfp_flags, 1, &ret, &ac);
 
-	ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
+	ret = kasan_kmalloc(s, ret, orig_size, gfp_flags);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 12/16] mm/slab: pass slab_alloc_context to __do_kmalloc_node()
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
alloc flag that prevents kmalloc recursion. For that we need a version
of kmalloc() that takes alloc_flags and use it in places that perform
these potentially recursive kmalloc allocations (of sheaves or obj_ext
arrays).

As a preparatory step, make __do_kmalloc_node() take a pointer to
slab_alloc_context. This replaces the 'caller' parameter and includes
alloc_flags which we'll make use of.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ef457e07db83..6845e15c148a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5338,19 +5338,14 @@ EXPORT_SYMBOL(__kmalloc_large_node_noprof);
 
 static __always_inline
 void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
-			unsigned long caller, kmalloc_token_t token)
+			kmalloc_token_t token, struct slab_alloc_context *ac)
 {
 	struct kmem_cache *s;
 	void *ret;
-	struct slab_alloc_context ac = {
-		.caller_addr = caller,
-		.orig_size = size,
-		.alloc_flags = SLAB_ALLOC_DEFAULT,
-	};
 
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
 		ret = __kmalloc_large_node_noprof(size, flags, node);
-		trace_kmalloc(caller, ret, size,
+		trace_kmalloc(ac->caller_addr, ret, size,
 			      PAGE_SIZE << get_order(size), flags, node);
 		return ret;
 	}
@@ -5360,22 +5355,34 @@ void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
 
 	s = kmalloc_slab(size, b, flags, token);
 
-	ret = slab_alloc_node(s, flags, node, &ac);
+	ret = slab_alloc_node(s, flags, node, ac);
 	ret = kasan_kmalloc(s, ret, size, flags);
-	trace_kmalloc(caller, ret, size, s->size, flags, node);
+	trace_kmalloc(ac->caller_addr, ret, size, s->size, flags, node);
 	return ret;
 }
 void *__kmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node)
 {
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
 	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
-				 _RET_IP_, PASS_TOKEN_PARAM(token));
+				 PASS_TOKEN_PARAM(token), &ac);
 }
 EXPORT_SYMBOL(__kmalloc_node_noprof);
 
 void *__kmalloc_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t flags)
 {
-	return __do_kmalloc_node(size, NULL, flags,  NUMA_NO_NODE, _RET_IP_,
-				 PASS_TOKEN_PARAM(token));
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
+	return __do_kmalloc_node(size, NULL, flags,  NUMA_NO_NODE,
+				 PASS_TOKEN_PARAM(token), &ac);
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
@@ -5471,9 +5478,14 @@ EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
 void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
 					 int node, unsigned long caller)
 {
-	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
-				 caller, PASS_TOKEN_PARAM(token));
+	struct slab_alloc_context ac = {
+		.caller_addr = caller,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
+	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
+				 PASS_TOKEN_PARAM(token), &ac);
 }
 EXPORT_SYMBOL(__kmalloc_node_track_caller_noprof);
 
@@ -6874,6 +6886,11 @@ void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long
 {
 	bool allow_block;
 	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
 	/*
 	 * It doesn't really make sense to fallback to vmalloc for sub page
@@ -6881,7 +6898,7 @@ void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long
 	 */
 	ret = __do_kmalloc_node(size, PASS_BUCKET_PARAM(b),
 				kmalloc_gfp_adjust(flags, size),
-				node, _RET_IP_, PASS_TOKEN_PARAM(token));
+				node, PASS_TOKEN_PARAM(token), &ac);
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 11/16] mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

The last user of gfpflags_allow_spinning() in slab is
alloc_from_pcs_bulk(), which is only called from
kmem_cache_alloc_bulk().

It turns out that gfpflags_allow_spinning() is not necessary, because
kmem_cache_alloc_bulk() is only expected to be called from context that
does allow spinning, so simply replace it with 'true'.

With that, we can remove the "@flags must allow spinning" part of the
kernel doc, as there is no more connection to the gfp flags in the slab
implementation.

Also remove a comment in alloc_slab_obj_exts() because there should be
no more false positives possible due to gfp_allowed_mask during early
boot.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 0b9974bfcb24..ef457e07db83 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2171,12 +2171,6 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 
 	sz = obj_exts_alloc_size(s, slab, gfp);
 
-	/*
-	 * Note that allow_spin may be false during early boot and its
-	 * restricted GFP_BOOT_MASK. Due to kmalloc_nolock() only supporting
-	 * architectures with cmpxchg16b, early obj_exts will be missing for
-	 * very early allocations on those.
-	 */
 	if (unlikely(!allow_spin))
 		vec = kmalloc_nolock(sz, __GFP_ZERO | __GFP_NO_OBJ_EXT,
 				     slab_nid(slab));
@@ -4867,7 +4861,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
 		}
 
 		full = barn_replace_empty_sheaf(barn, pcs->main,
-						gfpflags_allow_spinning(gfp));
+						/* allow_spin = */ true);
 
 		if (full) {
 			stat(s, BARN_GET);
@@ -7333,8 +7327,7 @@ static bool __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
  * Allocate @size objects from @s and places them into @p.  @size must be larger
  * than 0.
  *
- * Interrupts must be enabled when calling this function and @flags must allow
- * spinning.
+ * Interrupts must be enabled when calling this function.
  *
  * Unlike alloc_pages_bulk(), this function does not check for already allocated
  * objects in @p, and thus the caller does not need to zero it.

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 10/16] mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

The function takes all the parameters that exist as fields in
slab_alloc_context, except alloc_flags. Replace them with a single
pointer.

This moves slab_alloc_context initialization to a number of callers,
which is more verbose, but arguably also more clear than a long list of
parameters, and most do not use the 'lru' field.

This will also allow kmalloc_nolock() to call slab_alloc_node() and
reduce the special open-coding it currently has.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 75 ++++++++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 53 insertions(+), 22 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e634137b67fa..0b9974bfcb24 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4921,30 +4921,23 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
  *
  * Otherwise we can simply pick the next object from the lockless free list.
  */
-static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
-		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
+static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s,
+		gfp_t gfpflags, int node, struct slab_alloc_context *ac)
 {
-	const unsigned int alloc_flags = SLAB_ALLOC_DEFAULT;
 	void *object;
-	struct slab_alloc_context ac = {
-		.caller_addr = addr,
-		.orig_size = orig_size,
-		.alloc_flags = alloc_flags,
-		.lru = lru,
-	};
 
 	s = slab_pre_alloc_hook(s, gfpflags);
 	if (unlikely(!s))
 		return NULL;
 
-	object = kfence_alloc(s, orig_size, gfpflags);
+	object = kfence_alloc(s, ac->orig_size, gfpflags);
 	if (unlikely(object))
 		goto out;
 
-	object = alloc_from_pcs(s, gfpflags, alloc_flags, node);
+	object = alloc_from_pcs(s, gfpflags, ac->alloc_flags, node);
 
 	if (!object)
-		object = __slab_alloc_node(s, gfpflags, node, &ac);
+		object = __slab_alloc_node(s, gfpflags, node, ac);
 
 	maybe_wipe_obj_freeptr(s, object);
 
@@ -4953,15 +4946,21 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	 * In case this fails due to memcg_slab_post_alloc_hook(),
 	 * object is set to NULL
 	 */
-	slab_post_alloc_hook(s, gfpflags, 1, &object, &ac);
+	slab_post_alloc_hook(s, gfpflags, 1, &object, ac);
 
 	return object;
 }
 
 void *kmem_cache_alloc_noprof(struct kmem_cache *s, gfp_t gfpflags)
 {
-	void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE, _RET_IP_,
-				    s->object_size);
+	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = s->object_size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
+	ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, &ac);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
 
@@ -4972,8 +4971,15 @@ EXPORT_SYMBOL(kmem_cache_alloc_noprof);
 void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
 			   gfp_t gfpflags)
 {
-	void *ret = slab_alloc_node(s, lru, gfpflags, NUMA_NO_NODE, _RET_IP_,
-				    s->object_size);
+	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = s->object_size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+		.lru = lru,
+	};
+
+	ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, &ac);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
 
@@ -5005,7 +5011,14 @@ EXPORT_SYMBOL(kmem_cache_charge);
  */
 void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size);
+	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = s->object_size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
+	ret = slab_alloc_node(s, gfpflags, node, &ac);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, node);
 
@@ -5335,6 +5348,11 @@ void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
 {
 	struct kmem_cache *s;
 	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = caller,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) {
 		ret = __kmalloc_large_node_noprof(size, flags, node);
@@ -5348,7 +5366,7 @@ void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
 
 	s = kmalloc_slab(size, b, flags, token);
 
-	ret = slab_alloc_node(s, NULL, flags, node, caller, size);
+	ret = slab_alloc_node(s, flags, node, &ac);
 	ret = kasan_kmalloc(s, ret, size, flags);
 	trace_kmalloc(caller, ret, size, s->size, flags, node);
 	return ret;
@@ -5467,8 +5485,14 @@ EXPORT_SYMBOL(__kmalloc_node_track_caller_noprof);
 
 void *__kmalloc_cache_noprof(struct kmem_cache *s, gfp_t gfpflags, size_t size)
 {
-	void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE,
-					    _RET_IP_, size);
+	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
+	ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, &ac);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, NUMA_NO_NODE);
 
@@ -5480,7 +5504,14 @@ EXPORT_SYMBOL(__kmalloc_cache_noprof);
 void *__kmalloc_cache_node_noprof(struct kmem_cache *s, gfp_t gfpflags,
 				  int node, size_t size)
 {
-	void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, size);
+	void *ret;
+	struct slab_alloc_context ac = {
+		.caller_addr = _RET_IP_,
+		.orig_size = size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
+
+	ret = slab_alloc_node(s, gfpflags, node, &ac);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, node);
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 09/16] mm/slab: pass alloc_flags through slab_post_alloc_hook() chain
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Convert the whole following call stack to pass either slab_alloc_context
(thus including alloc_flags) or just alloc_flags as necessary:

slab_post_alloc_hook()
  alloc_tagging_slab_alloc_hook()
    __alloc_tagging_slab_alloc_hook()
      prepare_slab_obj_exts_hook()
        alloc_slab_obj_exts()
  memcg_slab_post_alloc_hook()
    __memcg_slab_post_alloc_hook()
      alloc_slab_obj_exts()

Converting all these at once avoids unnecessary churn and is mostly
mechanical.

This ultimately allows to decide if spinning is allowed using
alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook().
Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing
else in slab itself relying on gfpflags_allow_spinning() which can
be false even if not called from kmalloc_nolock().

A followup change will also use the alloc_flags availability in the call
stack above to remove the __GFP_NO_OBJ_EXT flag.

For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab"
parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality.

To further reduce the number of parameters of slab_post_alloc_hook(),
also make 'struct list_lru *lru' (which is NULL for most callers) a new
field of slab_alloc_context.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/memcontrol.c |  5 +--
 mm/slab.h       |  6 ++--
 mm/slub.c       | 94 +++++++++++++++++++++++++++++++++------------------------
 3 files changed, 62 insertions(+), 43 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d466..29390ba13baa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3424,7 +3424,8 @@ static inline size_t obj_full_size(struct kmem_cache *s)
 }
 
 bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
-				  gfp_t flags, size_t size, void **p)
+				  gfp_t flags, unsigned int slab_alloc_flags,
+				  size_t size, void **p)
 {
 	size_t obj_size = obj_full_size(s);
 	struct obj_cgroup *objcg;
@@ -3472,7 +3473,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		slab = virt_to_slab(p[i]);
 
 		if (!slab_obj_exts(slab) &&
-		    alloc_slab_obj_exts(slab, s, flags, false)) {
+		    alloc_slab_obj_exts(slab, s, flags, slab_alloc_flags)) {
 			continue;
 		}
 
diff --git a/mm/slab.h b/mm/slab.h
index 96f65b625600..4db6d8aa0ee3 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -19,6 +19,7 @@
 /* slab's alloc_flags definitions */
 #define SLAB_ALLOC_DEFAULT	0x00 /* no flags */
 #define SLAB_ALLOC_TRYLOCK	0x01 /* a kmalloc_nolock() allocation */
+#define SLAB_ALLOC_NEW_SLAB	0x02 /* a flag for alloc_slab_obj_exts() */
 
 static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
 {
@@ -612,7 +613,7 @@ static inline struct slabobj_ext *slab_obj_ext(struct slab *slab,
 }
 
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
-                        gfp_t gfp, bool new_slab);
+			gfp_t gfp, unsigned int alloc_flags);
 
 #else /* CONFIG_SLAB_OBJ_EXT */
 
@@ -642,7 +643,8 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
 
 #ifdef CONFIG_MEMCG
 bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
-				  gfp_t flags, size_t size, void **p);
+				  gfp_t flags, unsigned int slab_alloc_flags,
+				  size_t size, void **p);
 void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 			    void **p, int objects, unsigned long obj_exts);
 #endif
diff --git a/mm/slub.c b/mm/slub.c
index 8f6ca3d5fdfa..e634137b67fa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -218,6 +218,7 @@ struct slab_alloc_context {
 	unsigned long caller_addr;
 	unsigned long orig_size;
 	unsigned int alloc_flags;
+	struct list_lru *lru;
 };
 
 /* Structure holding parameters for get_partial_node_bulk() */
@@ -2155,9 +2156,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
 }
 
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
-		        gfp_t gfp, bool new_slab)
+			gfp_t gfp, unsigned int alloc_flags)
 {
-	bool allow_spin = gfpflags_allow_spinning(gfp);
+	const bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
 	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long new_exts;
 	unsigned long old_exts;
@@ -2206,7 +2207,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 	old_exts = READ_ONCE(slab->obj_exts);
 	handle_failed_objexts_alloc(old_exts, vec, objects);
 
-	if (new_slab) {
+	if (alloc_flags & SLAB_ALLOC_NEW_SLAB) {
 		/*
 		 * If the slab is brand new and nobody can yet access its
 		 * obj_exts, no synchronization is required and obj_exts can
@@ -2331,7 +2332,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
 }
 
 static int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
-			       gfp_t gfp, bool new_slab)
+			       gfp_t gfp, unsigned int alloc_flags)
 {
 	return 0;
 }
@@ -2351,10 +2352,10 @@ static inline void alloc_slab_obj_exts_early(struct kmem_cache *s,
 
 static inline unsigned long
 prepare_slab_obj_exts_hook(struct kmem_cache *s, struct slab *slab,
-			   gfp_t flags, void *p)
+			   gfp_t flags, unsigned int alloc_flags, void *p)
 {
 	if (!slab_obj_exts(slab) &&
-	    alloc_slab_obj_exts(slab, s, flags, false)) {
+	    alloc_slab_obj_exts(slab, s, flags, alloc_flags)) {
 		pr_warn_once("%s, %s: Failed to create slab extension vector!\n",
 			     __func__, s->name);
 		return 0;
@@ -2366,7 +2367,8 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, struct slab *slab,
 
 /* Should be called only if mem_alloc_profiling_enabled() */
 static noinline void
-__alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
+__alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
+				unsigned int alloc_flags)
 {
 	unsigned long obj_exts;
 	struct slabobj_ext *obj_ext;
@@ -2382,7 +2384,7 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
 		return;
 
 	slab = virt_to_slab(object);
-	obj_exts = prepare_slab_obj_exts_hook(s, slab, flags, object);
+	obj_exts = prepare_slab_obj_exts_hook(s, slab, flags, alloc_flags, object);
 	/*
 	 * Currently obj_exts is used only for allocation profiling.
 	 * If other users appear then mem_alloc_profiling_enabled()
@@ -2401,10 +2403,11 @@ __alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
 }
 
 static inline void
-alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
+alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
+			      unsigned int alloc_flags)
 {
 	if (mem_alloc_profiling_enabled())
-		__alloc_tagging_slab_alloc_hook(s, object, flags);
+		__alloc_tagging_slab_alloc_hook(s, object, flags, alloc_flags);
 }
 
 /* Should be called only if mem_alloc_profiling_enabled() */
@@ -2443,7 +2446,8 @@ alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
 #else /* CONFIG_MEM_ALLOC_PROFILING */
 
 static inline void
-alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags)
+alloc_tagging_slab_alloc_hook(struct kmem_cache *s, void *object, gfp_t flags,
+			      unsigned int alloc_flags)
 {
 }
 
@@ -2461,8 +2465,9 @@ alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
 static void memcg_alloc_abort_single(struct kmem_cache *s, void *object);
 
 static __fastpath_inline
-bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
-				gfp_t flags, size_t size, void **p)
+bool memcg_slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
+				size_t size, void **p,
+				struct slab_alloc_context *ac)
 {
 	if (likely(!memcg_kmem_online()))
 		return true;
@@ -2470,7 +2475,8 @@ bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	if (likely(!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT)))
 		return true;
 
-	if (likely(__memcg_slab_post_alloc_hook(s, lru, flags, size, p)))
+	if (likely(__memcg_slab_post_alloc_hook(s, ac->lru, flags,
+						ac->alloc_flags, size, p)))
 		return true;
 
 	if (likely(size == 1)) {
@@ -2558,14 +2564,15 @@ bool memcg_slab_post_charge(void *p, gfp_t flags)
 		put_slab_obj_exts(obj_exts);
 	}
 
-	return __memcg_slab_post_alloc_hook(s, NULL, flags, 1, &p);
+	return __memcg_slab_post_alloc_hook(s, NULL, flags, SLAB_ALLOC_DEFAULT,
+					    1, &p);
 }
 
 #else /* CONFIG_MEMCG */
 static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s,
-					      struct list_lru *lru,
-					      gfp_t flags, size_t size,
-					      void **p)
+					      gfp_t flags,
+					      size_t size, void **p,
+					      struct slab_alloc_context *ac)
 {
 	return true;
 }
@@ -3352,12 +3359,14 @@ static inline void init_freelist_randomization(void) { }
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
 static __always_inline void account_slab(struct slab *slab, int order,
-					 struct kmem_cache *s, gfp_t gfp)
+					 struct kmem_cache *s, gfp_t gfp,
+					 unsigned int alloc_flags)
 {
 	if (memcg_kmem_online() &&
 			(s->flags & SLAB_ACCOUNT) &&
 			!slab_obj_exts(slab))
-		alloc_slab_obj_exts(slab, s, gfp, true);
+		alloc_slab_obj_exts(slab, s, gfp,
+				    alloc_flags | SLAB_ALLOC_NEW_SLAB);
 
 	mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
@@ -3434,7 +3443,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags,
 	 * to prevent the array from being overwritten.
 	 */
 	alloc_slab_obj_exts_early(s, slab);
-	account_slab(slab, oo_order(oo), s, flags);
+	account_slab(slab, oo_order(oo), s, flags, alloc_flags);
 
 	return slab;
 }
@@ -4568,9 +4577,8 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
 }
 
 static __fastpath_inline
-bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
-			  gfp_t flags, size_t size, void **p,
-			  unsigned int orig_size)
+bool slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, size_t size,
+			  void **p, struct slab_alloc_context *ac)
 {
 	bool init = slab_want_init_on_alloc(flags, s);
 	unsigned int zero_size = s->object_size;
@@ -4590,7 +4598,7 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	 * orig_size if we track it.
 	 */
 	if (slub_debug_orig_size(s))
-		zero_size = orig_size;
+		zero_size = ac->orig_size;
 
 	/*
 	 * When slab_debug is enabled, avoid memory initialization integrated
@@ -4616,14 +4624,14 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 				     !kasan_has_integrated_init())
 				 && !is_kfence_address(p[i]))
 			memset(p[i], 0, zero_size);
-		if (gfpflags_allow_spinning(flags))
+		if (alloc_flags_allow_spinning(ac->alloc_flags))
 			kmemleak_alloc_recursive(p[i], s->object_size, 1,
 						 s->flags, init_flags);
 		kmsan_slab_alloc(s, p[i], init_flags);
-		alloc_tagging_slab_alloc_hook(s, p[i], flags);
+		alloc_tagging_slab_alloc_hook(s, p[i], flags, ac->alloc_flags);
 	}
 
-	return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
+	return memcg_slab_post_alloc_hook(s, flags, size, p, ac);
 }
 
 /*
@@ -4918,6 +4926,12 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 {
 	const unsigned int alloc_flags = SLAB_ALLOC_DEFAULT;
 	void *object;
+	struct slab_alloc_context ac = {
+		.caller_addr = addr,
+		.orig_size = orig_size,
+		.alloc_flags = alloc_flags,
+		.lru = lru,
+	};
 
 	s = slab_pre_alloc_hook(s, gfpflags);
 	if (unlikely(!s))
@@ -4929,14 +4943,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 
 	object = alloc_from_pcs(s, gfpflags, alloc_flags, node);
 
-	if (unlikely(!object)) {
-		struct slab_alloc_context ac = {
-			.caller_addr = addr,
-			.orig_size = orig_size,
-			.alloc_flags = alloc_flags,
-		};
+	if (!object)
 		object = __slab_alloc_node(s, gfpflags, node, &ac);
-	}
 
 	maybe_wipe_obj_freeptr(s, object);
 
@@ -4945,7 +4953,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	 * In case this fails due to memcg_slab_post_alloc_hook(),
 	 * object is set to NULL
 	 */
-	slab_post_alloc_hook(s, lru, gfpflags, 1, &object, orig_size);
+	slab_post_alloc_hook(s, gfpflags, 1, &object, &ac);
 
 	return object;
 }
@@ -5240,6 +5248,10 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
 				   struct slab_sheaf *sheaf)
 {
 	void *ret = NULL;
+	struct slab_alloc_context ac = {
+		.orig_size = s->object_size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
 	if (sheaf->size == 0)
 		goto out;
@@ -5250,7 +5262,7 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
 		ret = sheaf->objects[--sheaf->size];
 
 	/* add __GFP_NOFAIL to force successful memcg charging */
-	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, s->object_size);
+	slab_post_alloc_hook(s, gfp | __GFP_NOFAIL, 1, &ret, &ac);
 out:
 	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
 
@@ -5437,7 +5449,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
 
 success:
 	maybe_wipe_obj_freeptr(s, ret);
-	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, orig_size);
+	slab_post_alloc_hook(s, alloc_gfp, 1, &ret, &ac);
 
 	ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
 	return ret;
@@ -7303,6 +7315,10 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
 {
 	unsigned int i = 0;
 	void *kfence_obj;
+	struct slab_alloc_context ac = {
+		.orig_size = s->object_size,
+		.alloc_flags = SLAB_ALLOC_DEFAULT,
+	};
 
 	if (!size)
 		return false;
@@ -7353,7 +7369,7 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
 
 out:
 	/* memcg and kmem_cache debug support and memory initialization */
-	return likely(slab_post_alloc_hook(s, NULL, flags, size, p, s->object_size));
+	return likely(slab_post_alloc_hook(s, flags, size, p, &ac));
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 08/16] mm/slab: pass alloc_flags to new slab allocation
From: Vlastimil Babka (SUSE) @ 2026-06-10 15:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups, Vlastimil Babka (SUSE)
In-Reply-To: <20260610-slab_alloc_flags-v2-0-7190909db118@kernel.org>

Add the alloc_flags parameter to allocate_slab() and new_slab()
so it can be used to determine if spinning is allowed, independently
from gfp flags.

refill_objects() passes SLAB_ALLOC_DEFAULT because it can only be
reached from contexts that allow spinning.

Also change how trynode_flags are constructed in ___slab_alloc() to
achieve the same "do not upgrade to GFP_NOWAIT" by using masking instead
of a branch. It will now also not upgrade in cases where gfp is weaker
than GFP_NOWAIT (i.e. lacks __GFP_KSWAPD_RECLAIM) but doesn't come from
kmalloc_nolock() - which is more correct anyway.

During the masking keep also existing __GFP_NOMEMALLOC (pointed out by
Sashiko) and __GFP_ACCOUNT. Previously the hardcoded GFP_NOWAIT would
eliminate them, but it's not a big problem that would need a separate
fix.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 98b79e5e7679..8f6ca3d5fdfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3378,9 +3378,10 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 }
 
 /* Allocate and initialize a slab without building its freelist. */
-static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags,
+				  unsigned int alloc_flags, int node)
 {
-	bool allow_spin = gfpflags_allow_spinning(flags);
+	bool allow_spin = alloc_flags_allow_spinning(alloc_flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
@@ -3438,15 +3439,17 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	return slab;
 }
 
-static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slab *new_slab(struct kmem_cache *s, gfp_t flags,
+			     unsigned int alloc_flags, int node)
 {
 	if (unlikely(flags & GFP_SLAB_BUG_MASK))
 		flags = kmalloc_fix_flags(flags);
 
 	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
 
-	return allocate_slab(s,
-		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+	flags &= GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK;
+
+	return allocate_slab(s, flags, alloc_flags, node);
 }
 
 static void __free_slab(struct kmem_cache *s, struct slab *slab, bool allow_spin)
@@ -4467,25 +4470,22 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 * 1) try to get a partial slab from target node only by having
 	 *    __GFP_THISNODE in pc.flags for get_from_partial()
 	 * 2) if 1) failed, try to allocate a new slab from target node with
-	 *    GPF_NOWAIT | __GFP_THISNODE opportunistically
+	 *    (at most) GFP_NOWAIT | __GFP_THISNODE opportunistically
 	 * 3) if 2) failed, retry with original gfpflags which will allow
 	 *    get_from_partial() try partial lists of other nodes before
 	 *    potentially allocating new page from other nodes
 	 */
 	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
 		     && try_thisnode)) {
-		if (unlikely(!allow_spin))
-			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
-			trynode_flags = gfpflags | __GFP_THISNODE;
-		else
-			trynode_flags = GFP_NOWAIT | __GFP_THISNODE;
+		trynode_flags &= GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_ACCOUNT;
+		trynode_flags |= __GFP_NOWARN | __GFP_THISNODE;
 	}
 
 	object = get_from_partial(s, node, trynode_flags, ac);
 	if (object)
 		goto success;
 
-	slab = new_slab(s, trynode_flags, node);
+	slab = new_slab(s, trynode_flags, ac->alloc_flags, node);
 
 	if (unlikely(!slab)) {
 		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
@@ -7231,7 +7231,7 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 
 new_slab:
 
-	slab = new_slab(s, gfp, local_node);
+	slab = new_slab(s, gfp, SLAB_ALLOC_DEFAULT, local_node);
 	if (!slab)
 		goto out;
 
@@ -7579,7 +7579,7 @@ static void early_kmem_cache_node_alloc(int node)
 
 	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
 
-	slab = new_slab(kmem_cache_node, GFP_NOWAIT, node);
+	slab = new_slab(kmem_cache_node, GFP_NOWAIT, SLAB_ALLOC_DEFAULT, node);
 
 	BUG_ON(!slab);
 	if (slab_nid(slab) != node) {

-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox