Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] __GFP_UNMAPPED and __GFP_PRIVATE follow up
@ 2026-05-14 17:42 Gregory Price
  2026-05-15  9:43 ` Brendan Jackman
  0 siblings, 1 reply; 4+ messages in thread
From: Gregory Price @ 2026-05-14 17:42 UTC (permalink / raw)
  To: linux-mm, jackmanb
  Cc: kernel-team, vishal.l.verma, ira.weiny, dan.j.williams, longman,
	akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, sj, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng, nphamcs,
	bhe, zhengqi.arch, terry.bowman

I'm sending this as a general follow up to the __GFP_UNMAPPED and
__GFP_PRIVATE proposals that were discussed at LSFMMBPF '26

__GFP_PRIVATE
https://lore.kernel.org/linux-mm/20260222084842.1824063-3-gourry@gourry.net/
__GFP_UNMAPPED
https://lore.kernel.org/linux-mm/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/

There is a general push to avoid new GFP flags, and there were common
questions about alloc_context.


I have an idea for that, but first, let me address something about
__GFP_PRIVATE.

For __GFP_PRIVATE there was a question about whether the global nodemask
interfaces could be fixed. I've taken a bit of time to look at this and
I'm again left saying: Not without completely reinventing the wheel.

In particular, there's nothing that prevents an N_MEMORY_PRIVATE node
from also being N_CPU or N_GENERIC_INTIATOR. 

In addition, there are a few hundred instances across the kernel of
nodemasks being cobbled together from node_states[] masks and stuff
like remap operations that may result in a private node finding its
way into a nodemask.

This kind of pattern isn't going away, and node_states have UAPI
implications associated with them :[.

The reality we really need to make the allocation request explicit
via some argument to the allocator if we want to re-use that code.


Yesterday I spitballed the addition of a new alloc interface:
https://lore.kernel.org/linux-mm/agS76pNPlPVLgpFA@gourry-fedora-PF4VCD3F/

I cannot speak for Brendan, however, in his cover letter he said:
https://lore.kernel.org/linux-mm/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/

  For now I still assume a GFP flag is the cleanest way to get that
  but in principle I'm not opposed to alloc_unmapped_pages() ...


His proposal looks a lot like ALLOC_CMA, in my opinion.

I'm wondering if we can solve both of these with an alloc_context
extension.  In fact, I'm wondering if some GFP flags should actually
be alloc flags anyway.

We have more flexibility with alloc_flags (for now) because they're only
defined in mm/internal.h.

Maybe we could modify alloc_flags to be a struct, and export that
without being tied to down to a 32/64-bit flag field - and mark certain
sets of alloc flags verboten (internally controlled / controlled by GFP
flags, and will either be ignored or cause a BUG()).

Then we could get something like:

    struct alloc_flags {
        /* 
	 * internal only: will be ignored, cleared, or cause BUG() if used,
	 * or should be applied via the appropriate __GFP flag.
	 */
        uint64_t wmark_min : 1;
        uint64_t wmark_low : 1;
        uint64_t wmark_high : 1;
	... etc ...
	/* 
	 * external context flags
	 * allows explicit access to certain resources
	 */
	uint64_t cma          : 1; /* allows access to CMA regions */
	uint64_t unmapped     : 1; /* return pages in unmapped state */
	uint64_t managed_node : 1; /* allows access to managed node */
	... etc ...
    };

    ___alloc_frozen_pages_noprof(..., struct alloc_context *ac) {
        ac->flags.wmark_low = 1;
	...
	prepare_alloc_pages(..., ac);
	ac->flags.nofrag = alloc_flags_nofragment(...)

	/* First allocation attempt */
        page = get_page_from_freelist(alloc_gfp, order, &ac);
	...
    }

    __alloc_frozen_pages_noprof(...) {
        struct alloc_context ac = {};

	___alloc_frozen_pages_noprof(..., ac);
    }

    __alloc_frozen_pages_context_noprof(..., struct alloc_flags *aflags) {
        struct alloc_context ac = {};
	
        /* Snapshot to prevent external changes */
	ac.flags = aflags ? *aflags : 0;

        sanitize_alloc_flags(&ac.flags); /* BUG() on insanity */
        ___alloc_frozen_pages_noprof(..., ac);
    }

For existing users, they can continue to use __GFP flags and existing
allocation interfaces.  For special context users, they can use the
context interface.

For __GFP_PRIVATE, this would look like modifying just a handful of
interfaces to include alloc_context or alloc_flags - e.g.:

   folio_alloc_mpol(gfp, order, pol, ilx, nid)
       ->
   folio_alloc_mpol(ac, order, pol_ilx, nid);

And a bit of logic to simply set:

   ac.flags.managed_node = 1;

This kind of pattern already exists with things like scan_control,
oom_control, etc - which carry gfp masks around.  Maybe those things
should just carry the full alloc_context around (w/ gfp and flags).

~Gregory


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] __GFP_UNMAPPED and __GFP_PRIVATE follow up
  2026-05-14 17:42 [RFC] __GFP_UNMAPPED and __GFP_PRIVATE follow up Gregory Price
@ 2026-05-15  9:43 ` Brendan Jackman
  2026-05-15 15:48   ` Gregory Price
  0 siblings, 1 reply; 4+ messages in thread
From: Brendan Jackman @ 2026-05-15  9:43 UTC (permalink / raw)
  To: Gregory Price, linux-mm, jackmanb
  Cc: kernel-team, vishal.l.verma, ira.weiny, dan.j.williams, longman,
	akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, sj, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
	lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng, nphamcs,
	bhe, zhengqi.arch, terry.bowman, owner-linux-mm

On Thu May 14, 2026 at 5:42 PM UTC, Gregory Price wrote:
...
> Maybe we could modify alloc_flags to be a struct, and export that
> without being tied to down to a 32/64-bit flag field - and mark certain
> sets of alloc flags verboten (internally controlled / controlled by GFP
> flags, and will either be ignored or cause a BUG()).
>
> Then we could get something like:
>
>     struct alloc_flags {
>         /* 
> 	 * internal only: will be ignored, cleared, or cause BUG() if used,
> 	 * or should be applied via the appropriate __GFP flag.
> 	 */
>         uint64_t wmark_min : 1;
>         uint64_t wmark_low : 1;
>         uint64_t wmark_high : 1;
> 	... etc ...
> 	/* 
> 	 * external context flags
> 	 * allows explicit access to certain resources
> 	 */
> 	uint64_t cma          : 1; /* allows access to CMA regions */
> 	uint64_t unmapped     : 1; /* return pages in unmapped state */
> 	uint64_t managed_node : 1; /* allows access to managed node */
> 	... etc ...
>     };
>
>     ___alloc_frozen_pages_noprof(..., struct alloc_context *ac) {
>         ac->flags.wmark_low = 1;
> 	...
> 	prepare_alloc_pages(..., ac);
> 	ac->flags.nofrag = alloc_flags_nofragment(...)
>
> 	/* First allocation attempt */
>         page = get_page_from_freelist(alloc_gfp, order, &ac);
> 	...
>     }
>
>     __alloc_frozen_pages_noprof(...) {
>         struct alloc_context ac = {};
>
> 	___alloc_frozen_pages_noprof(..., ac);
>     }
>
>     __alloc_frozen_pages_context_noprof(..., struct alloc_flags *aflags) {
>         struct alloc_context ac = {};
> 	
>         /* Snapshot to prevent external changes */
> 	ac.flags = aflags ? *aflags : 0;
>
>         sanitize_alloc_flags(&ac.flags); /* BUG() on insanity */
>         ___alloc_frozen_pages_noprof(..., ac);
>     }

Yeah, I have had a similar thought before. In fact, I wonder if we could
have a pointer in there that effectively allows you to replace
NODE_DATA? I think that would be a more general mechanism to achieve
that `managed_node` thing?

My original motive for that was: if we could get the allocator to stop
[unconditionally] mutating global variables it would make it easier to
test.

My feeling from poking around in the code is that setting this up is
actually quite a big job in page_alloc.c. But, I think it could be done
in a way that leaves the code better instead of worse.

There might be some annoying stuff like "turning these things that are
currently function arguments into struct fields effectively causes a
register spill and this code is hot enough for that to matter"? But that
seems like a bridge to cross if we come to it, not something to
premature-optimise over. (Do register spills matter in 2026 anyway?
I think registers and the stack are kinda virtual?)

(Sorry this is such a vague thumbs up without really contributing
anything but I'm just giving what I've got :D)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] __GFP_UNMAPPED and __GFP_PRIVATE follow up
  2026-05-15  9:43 ` Brendan Jackman
@ 2026-05-15 15:48   ` Gregory Price
  2026-05-15 17:09     ` Brendan Jackman
  0 siblings, 1 reply; 4+ messages in thread
From: Gregory Price @ 2026-05-15 15:48 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: linux-mm, kernel-team, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, ying.huang, apopple, axelrasmussen, yuanchu,
	weixugc, yury.norov, linux, mhiramat, mathieu.desnoyers, tj,
	hannes, mkoutny, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng, nphamcs,
	bhe, zhengqi.arch, terry.bowman, owner-linux-mm

On Fri, May 15, 2026 at 09:43:02AM +0000, Brendan Jackman wrote:
> 
> Yeah, I have had a similar thought before. In fact, I wonder if we could
> have a pointer in there that effectively allows you to replace
> NODE_DATA? I think that would be a more general mechanism to achieve
> that `managed_node` thing?
>

Well, alloc_context already contains a nodemask.  I could see even
pulling that argument into the struct if we seriously consider exporting
alloc_context.

I'll have to think about the NODE_DATA replacement.  I don't know if
that's really feasible consider that this structure is used statically
all over the kernel for runtime node-data lookups from pages/folios.

> My original motive for that was: if we could get the allocator to stop
> [unconditionally] mutating global variables it would make it easier to
> test.
> 

Can you expand on this a bit more?
What globals are you referred to exactly?

There has been a desire on our side (Meta) to make mm/ more testable in
general (for both performance and correctness) - include page_alloc.c

But with everything so tightly coupled the best we can presently do is
runtime testing of benchmarks and workloads.

The same issue exists for things like LRU/MGLRU, where you can't really
isolate a change because you get emergent properties.

> My feeling from poking around in the code is that setting this up is
> actually quite a big job in page_alloc.c. But, I think it could be done
> in a way that leaves the code better instead of worse.
>

Yes, and being the literal bedrock for all of mm/ getting it wrong would
be catestrophic, so both a large job and high risk.

At the very least, using what exists (alloc_context) to extend to a new
interface for new users (unmapped, managed nodes, etc) while leaving what
is there until the new one becomes stable would be a good mitigation.

>
> There might be some annoying stuff like "turning these things that are
> currently function arguments into struct fields effectively causes a
> register spill and this code is hot enough for that to matter"? But that
> seems like a bridge to cross if we come to it, not something to
> premature-optimise over. (Do register spills matter in 2026 anyway?
> I think registers and the stack are kinda virtual?)
>

I'd be more worried about new stack allocations (alloc_context) and
populating it would lead to regressions than register spills, but it's
not worth thinking about untill there's data / it's testable.

(Another argument for making the core of this more testable)

>
> (Sorry this is such a vague thumbs up without really contributing
> anything but I'm just giving what I've got :D)
> 

I requested comments, i got comments :P Mission success.

~Gregory


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] __GFP_UNMAPPED and __GFP_PRIVATE follow up
  2026-05-15 15:48   ` Gregory Price
@ 2026-05-15 17:09     ` Brendan Jackman
  0 siblings, 0 replies; 4+ messages in thread
From: Brendan Jackman @ 2026-05-15 17:09 UTC (permalink / raw)
  To: Gregory Price, Brendan Jackman
  Cc: linux-mm, kernel-team, vishal.l.verma, ira.weiny, dan.j.williams,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, ying.huang, apopple, axelrasmussen, yuanchu,
	weixugc, yury.norov, linux, mhiramat, mathieu.desnoyers, tj,
	hannes, mkoutny, sj, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, muchun.song, xu.xin16, chengming.zhou, jannh,
	linmiaohe, nao.horiguchi, pfalcato, rientjes, shakeel.butt, riel,
	harry.yoo, cl, roman.gushchin, chrisl, kasong, shikemeng, nphamcs,
	bhe, zhengqi.arch, terry.bowman, owner-linux-mm

On Fri May 15, 2026 at 3:48 PM UTC, Gregory Price wrote:
> On Fri, May 15, 2026 at 09:43:02AM +0000, Brendan Jackman wrote:
>> 
>> Yeah, I have had a similar thought before. In fact, I wonder if we could
>> have a pointer in there that effectively allows you to replace
>> NODE_DATA? I think that would be a more general mechanism to achieve
>> that `managed_node` thing?
>>
>
> Well, alloc_context already contains a nodemask.  I could see even
> pulling that argument into the struct if we seriously consider exporting
> alloc_context.
>
> I'll have to think about the NODE_DATA replacement.  I don't know if
> that's really feasible consider that this structure is used statically
> all over the kernel for runtime node-data lookups from pages/folios.

Oh yeah I wasn't thinking of replacing it completely, more that the
global NODE_DATA is the "default" node data (and the pointer field being
NULL would probably mean to use that), but then private node datas could
also exist.

>> My original motive for that was: if we could get the allocator to stop
>> [unconditionally] mutating global variables it would make it easier to
>> test.
>> 
>
> Can you expand on this a bit more?
> What globals are you referred to exactly?

IIRC it's just a few, the vast majority of the global mutable state is
ultimately downstream of NODE_DATA, plus there are some tuning knobs.

> There has been a desire on our side (Meta) to make mm/ more testable in
> general (for both performance and correctness) - include page_alloc.c
>
> But with everything so tightly coupled the best we can presently do is
> runtime testing of benchmarks and workloads.
>
> The same issue exists for things like LRU/MGLRU, where you can't really
> isolate a change because you get emergent properties.

Yeah basically just this; if you try to write a test that constructs
some specific condition to try and hit the nastiest possible fallback
case, you are racing against the real system. Ditto if you wanna check
your code against page_group_by_mobility_disabled=1, well you'd better
figure out a way to actually boot a whole system with that property.

I spent a few days trying to solve this in the way that VMA and xarray
etc have, i.e. by making the page allocator a library and then testing
it from userspace. I think that would work but it needs much more than a
few days to make it happen (admittedly, I had tried to do this with AI
at the time and it failed miserably. Maybe with today's models it would
work, which could massively accelerate the grunt work of e.g. splitting
stuff into new files).

But anyway, if you could carve out a distinct node data etc you could just
write KUnit tests that operate on a completely isolated "instance" of
the allocator, even though it still runs in a real kernel.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-15 17:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 17:42 [RFC] __GFP_UNMAPPED and __GFP_PRIVATE follow up Gregory Price
2026-05-15  9:43 ` Brendan Jackman
2026-05-15 15:48   ` Gregory Price
2026-05-15 17:09     ` Brendan Jackman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox