[PATCH v2 00/22] mm: Add __GFP

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2 00/22] mm: Add __GFP_UNMAPPED
@ 2026-03-20 18:23 Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
                   ` (21 more replies)
  0 siblings, 22 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

.:: What? Why?

This series adds support for efficiently allocating pages that are not
present in the direct map. This is instrumental to two different
immediate goals:

1. This supports the effort to remove guest_memfd memory from the direct
   map [0]. One of the challenges faced in that effort has been
   efficiently eliminating TLB entries, this series offers a solution to
   that problem

2. Address Space Isolation (ASI) [1] also needs an efficient way to
   allocate pages that are missing from the direct map. Although for ASI
   the needs are slightly different (in that case, the pages need only
   be removed from ASI's special pagetables), the most interesting mm
   challenges are basically the same.

   So, __GFP_UNMAPPED serves as a Trojan horse to get the page allocator
   into a state where adding ASI's features "Should Be Easy".

   This series _also_ serves as a Trojan horse for the "mermap" (details
   below) which is also a key building block for making ASI efficient.

Longer term, there are a wide range of security techniques unlocked by
being able to efficiently remove pages from the direct map. This ranges
from straightforward nice-to-haves like removing unused aliases of the
vmalloc area, to more ambitious ideas like totally unmapping _all_ user
memory.

There may also be non-security usecases for this feature, for example
at LPC Sumit Garg presented an issue with memory-firewalled client
devices that could be remediated by __GFP_UNMAPPED [2]. It seems also
likely to be a key step towards dropping execmem as a distinct
allocation layer. 

.:: Design

The key design elements introduced here are just repurposed from
previous attempts to directly introduce ASI's needs to the page
allocator [3]. The only real difference is that now these support
totally unmapping stuff from the direct map, instead of only unmapping
it from ASI's special pagetables.

Note that it's important that allocating unmapped memory be fully
scalable, in principle supporting basically everything the mm can do -
previous __GFP_UNMAPPED proposals have implemented it as a cache on top
of the page allocator. That won't do here because the end goal here is
protecting user data. That means that ultimately, the vast majority of
memory in the system ought to be allocated through a mechanism in the
vein of __GFP_UNMAPPED, so a really full-featured allocator is required.
(In this series, __GFP_UNMAPPED is only supported for unmovable
allocations, but that's just to save space in datastructures: the
implementation is supposed to be fully generic, eventually allowing
reclaim and compaction for the unmapped pages).

Because of the above, I'm proposing a new GFP flag. However the only
part of that I'm really attached to is that it's an interface to the
real page allocator. For now I still assume a GFP flag is the cleanest
way to get that but in principle I'm not opposed to
alloc_unmapped_pages() or whatever. Alternatively, there's some
discussion in [9] about how to get more GFP space back, which I'm happy
to help with.

.:::: Design: Introducing "freetypes"

The biggest challenge for efficiently getting stuff out of the direct
map is TLB flushing. Pushing this problem into the page allocator turns
out to enable amortising that flush cost into almost nothing. The core
idea is to have pools of already-unmapped pages. We'd like those pages
to be physically contiguous so they don't unduly fragment the pagetables
around them, and we'd like to be able to efficiently look up these
already-unmapped pages during allocation. The page allocator already has
deeply-ingrained functionality for physically grouping pages by a
certain attribute, and then indexing free pages by that attribute, this
mechanism is: migratetypes.

So basically, this series extends the concepts of migratetypes in the
allocator so that as well as just representing mobility, they can
represent other properties of the page too. (Actually, migratetypes are
already sort of overloaded, but the main extension is to be able to
represent _orthogonal_ properties). In order to avoid further

Because of these ambitious usecases, it's core to this proposal that the
feature
overloading the concept of a migratetype, this extension is done by
adding a new concept on top of migratetype: the _freetype_. A freetype
is basically just a migratetype plus some flags, and it replaces
migratetypes wherever the latter is currently used as to index free
pages.

The first freetype flag is then added, which marks the pages it indexes
as being absent from the direct map. This is then used to implement the
new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
the new flag, or unmaps pages if no existing ones are already available.

.:::: Design: Introducing the "mermap"

Sharp readers might by now be asking how __GFP_UNMAPPED interacts with
__GFP_ZERO. If pages aren't in the direct map, how can the page
allocator zero them? The solution is the "mermap", short for "epheMERal
mapping". The mermap provides an efficient way to temporarily map pages
into the local address space, and the allocator uses these mappings to
zero pages.

Using the mermap securely requires some knowledge about the usage of the
pages. One slightly awkward part of this design is that the page
allocator's usage of the mermap then "leaks" out so that callers who
allocate with __GFP_UNMAPPED|__GFP_ZERO need to be aware of the mermap's
security implications. For the guest_memfd unmapping usecase, that means
when guest_memfd.c makes these special allocations, it is only safe
because the pages will belong to the current process. In other words,
the use of the mermap potentially allows that process to leak the pages
via CPU sidechannels (unless more holistic/expensive mitigations are
enabled).

Since this cover letter is already too long I won't describe most
details of the mermap here, please see the patch that introduces it.

But one key detail is that it requires a kernel-space but mm-local
virtual address region. So... this series adds that too (for x86). This
is called the mm-local region and is implemented by "just" extending and
generalising the LDT remap area.

.:: Outline of the patchset

- Patches  1 ->  2 introduce the mm-local region for x86

- Patches  3 ->  5 are prep patches for the mermap

- Patches  6 ->  7 introduce the mermap

- Patches  8 -> 11 introduce freetypes

  - Patch 8 in particular is the big annoying switch-over which changes
    a whole bunch of code from "migratetype" to "freetype". In order to
    try and have the compiler help out with catching bugs, this is done
    with an annoying typedef. I'm sorry that this patch is so annoying,
    but I think if we do want to extend the allocator along these lines
    then a typedef + big annoying patch is probably the safest way.

- Patches 12 -> 21 introduce __GFP_UNMAPPED

- Patch 22 adopts __GFP_UNMAPPED as for secretmem.

.:: Hacky bits

.:::: Hacky bits: mermap pagetable management

As discussed in [8], I can't find any existing pagetable management code
that serves the needs of the mermap. I have responded to this with some
extensions to __apply_to_page_range(). Those extensions make it serve
the mermap's needs re avoiding allocation and locking. But they don't
offer an efficient way to preallocate empty pagetables in
mermap_mm_prepare(). For now I think the inefficient way fine but if a
workload emerges that is sensitive to the latency of mermap setup, it
will need to be optimised. I think that would best be done as part of
the general pagetable library I hope to discuss in [8].

.:::: Hacky bits: simplistic secretmem integration

The secretmem integration leaves the mmain optimisations on the table;
the security-required flushes of the mermap areas are implemented via
distinct tlb_flush_mm() calls. It should be possible to amortize the
mermap TLB flushes completely into the normal VMA flushing. However, as
far as I know there is no performance-sensitive usecase for secretmem.
So, I've just implemented the minimal adoption. This will at least avoid
fragmentation of the direct map, even if it doesn't reduce TLB flushing.
If anyone knows of a workload that might benefit from dropping that
flushing, let me know!

.:: Performance

In [4] is a branch containing: 

1. This series.

2. All the key kernel patches from the Firecracker team's "secret-free"
   effort, which includes guest_memfd unmapping ([0]).

3. Some prototype patches to switch guest_memfd over from an ad-hoc
   unmapping logic to use of __GFP_UNMAPPED (plus direct use of the
   mermap to implement write()).

I benchmarked this using Firecracker's own performance tests [4], which
measure the time required to populate the VM guest's memory. This
population happens via write() so it exercises the mermap. I ran this on
a Sapphire Rapids machine [5]. The baseline here is just the secret-free
patches on their own. "gfp_unmapped" is the branch described above.
"skip-flush" provides a reference against an implementation that just
skips flushing the TLB when unmapping guest_memfd pages, which serves as
an upper-bound on performance.

metric: populate_latency (ms)   |  test: firecracker-perf-tests-wrapped
+---------------+---------+----------+----------+------------------------+----------+--------+
| nixos_variant | samples |     mean |      min | histogram              |      max | Δμ     |
+---------------+---------+----------+----------+------------------------+----------+--------+
|               |      30 |    1.04s |    1.02s |                     █  |    1.10s |        |
| gfp_unmapped  |      30 | 313.02ms | 299.48ms |       █                | 343.25ms | -70.0% |
| skip-flush    |      30 | 325.80ms | 307.91ms |       █                | 333.30ms | -68.8% |
+---------------+---------+----------+----------+------------------------+----------+--------+

Conclusion: it's close to the best case performance for this particular
workload. (Note in the sample above the mean is actually faster - that's
noise, this isn't a consistent observation).

[0] [PATCH v10 00/15] Direct Map Removal Support for guest_memfd
    https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@amazon.com/

[1] https://linuxasi.dev/

[2] https://lpc.events/event/19/contributions/2095/

[3] https://lore.kernel.org/lkml/20250313-asi-page-alloc-v1-0-04972e046cea@google.com/

[4] https://github.com/bjackman/kernel-benchmarks-nix/blob/fd56c93344760927b71161368230a15741a5869f/packages/benchmarks/firecracker-perf-tests/firecracker-perf-tests.sh

[5] https://github.com/bjackman/aethelred/blob/eb0dd0e99ee08fa0534733113e93b89499affe91

[6] https://github.com/bjackman/kernel-benchmarks-nix/blob/924a5cb3215360e4620eafcd7e00eb4e69e2ee93/packages/benchmarks/stress-ng/default.nix#L18

[7] https://lore.kernel.org/lkml/20230308094106.227365-1-rppt@kernel.org/

[8] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@google.com/

[9] https://lore.kernel.org/lkml/20260319-gfp64-v1-0-2c73b8d42b7f@google.com/

Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org
Cc: rppt@kernel.org
Cc: Sumit Garg <sumit.garg@oss.qualcomm.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Andrew Morton <akpm@linux-foundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Wei Xu <weixugc@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: yosryahmed@google.com
Cc: derkling@google.com
Cc: reijiw@google.com
Cc: Will Deacon <will@kernel.org>
Cc: rientjes@google.com
Cc: "Kalyazin, Nikita" <kalyazin@amazon.co.uk>
Cc: patrick.roy@linux.dev
Cc: "Itazuri, Takahiro" <itazur@amazon.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Kaplan <david.kaplan@amd.com>
Cc: Thomas Gleixner <tglx@kernel.org>

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
Changes in v2:
- Adopted it for secretmem
- Fixed mm-local region under KPTI on PAE
- Added __apply_to_get_range() mess
- Added fault injection for mermap_get(), fixed bug this uncovered
- Fixed various silly bugs
- Link to v1 (RFC): https://lore.kernel.org/r/20260225-page_alloc-unmapped-v1-0-e8808a03cd66@google.com

---
Brendan Jackman (22):
      x86/mm: split out preallocate_sub_pgd()
      x86/mm: Generalize LDT remap into "mm-local region"
      x86/tlb: Expose some flush function declarations to modules
      mm: Create flags arg for __apply_to_page_range()
      mm: Add more flags for __apply_to_page_range()
      x86/mm: introduce the mermap
      mm: KUnit tests for the mermap
      mm: introduce for_each_free_list()
      mm/page_alloc: don't overload migratetype in find_suitable_fallback()
      mm: introduce freetype_t
      mm: move migratetype definitions to freetype.h
      mm: add definitions for allocating unmapped pages
      mm: rejig pageblock mask definitions
      mm: encode freetype flags in pageblock flags
      mm/page_alloc: remove ifdefs from pindex helpers
      mm/page_alloc: separate pcplists by freetype flags
      mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
      mm/page_alloc: introduce ALLOC_NOBLOCK
      mm/page_alloc: implement __GFP_UNMAPPED allocations
      mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
      mm: Minimal KUnit tests for some new page_alloc logic
      mm/secretmem: Use __GFP_UNMAPPED when available

 Documentation/arch/x86/x86_64/mm.rst    |   4 +-
 arch/x86/Kconfig                        |   3 +
 arch/x86/include/asm/mermap.h           |  23 +
 arch/x86/include/asm/mmu_context.h      | 119 ++++-
 arch/x86/include/asm/page.h             |  32 ++
 arch/x86/include/asm/pgalloc.h          |  33 ++
 arch/x86/include/asm/pgtable_32_areas.h |   9 +-
 arch/x86/include/asm/pgtable_64_types.h |  18 +-
 arch/x86/include/asm/pgtable_types.h    |   2 +
 arch/x86/include/asm/tlbflush.h         |  43 +-
 arch/x86/kernel/ldt.c                   | 130 +-----
 arch/x86/mm/init_64.c                   |  44 +-
 arch/x86/mm/pgtable.c                   |  32 +-
 include/linux/freetype.h                | 147 ++++++
 include/linux/gfp.h                     |  25 +-
 include/linux/gfp_types.h               |  26 ++
 include/linux/mermap.h                  |  63 +++
 include/linux/mermap_types.h            |  44 ++
 include/linux/mm.h                      |  13 +
 include/linux/mm_types.h                |   6 +
 include/linux/mmzone.h                  |  84 ++--
 include/linux/pageblock-flags.h         |  16 +-
 include/trace/events/mmflags.h          |   9 +-
 kernel/fork.c                           |   6 +
 kernel/panic.c                          |   2 +
 kernel/power/snapshot.c                 |   8 +-
 mm/Kconfig                              |  43 ++
 mm/Makefile                             |   3 +
 mm/compaction.c                         |  36 +-
 mm/init-mm.c                            |   3 +
 mm/internal.h                           |  70 ++-
 mm/memory.c                             |  76 +--
 mm/mermap.c                             | 334 ++++++++++++++
 mm/mm_init.c                            |  11 +-
 mm/page_alloc.c                         | 788 +++++++++++++++++++++++---------
 mm/page_isolation.c                     |   2 +-
 mm/page_owner.c                         |   7 +-
 mm/page_reporting.c                     |   4 +-
 mm/pgalloc-track.h                      |   6 +
 mm/secretmem.c                          |  87 +++-
 mm/show_mem.c                           |   4 +-
 mm/tests/mermap_kunit.c                 | 250 ++++++++++
 mm/tests/page_alloc_kunit.c             | 250 ++++++++++
 43 files changed, 2341 insertions(+), 574 deletions(-)
---
base-commit: b5d083a3ed1e2798396d5e491432e887da8d4a06
change-id: 20260112-page_alloc-unmapped-944fe5d7b55c

Best regards,
-- 
Brendan Jackman <jackmanb@google.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd()
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 19:42   ` Dave Hansen
  2026-03-24 15:27   ` Borislav Petkov
  2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

This code will be needed elsewhere in a following patch. Split out the
trivial code move for easy review.

This changes the logging slightly: instead of panic() directly reporting
the level of the failure, there is now a generic panic message which
will be preceded by a separate warn that reports the level of the
failure. This is a simple way to have this helper suit the needs of its
new user as well as the existing one.

Other than logging, no functional change intended.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 arch/x86/include/asm/pgalloc.h | 33 +++++++++++++++++++++++++++++++
 arch/x86/mm/init_64.c          | 44 +++++++-----------------------------------
 2 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index c88691b15f3c6..3541b86c9c6b0 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_PGALLOC_H
 #define _ASM_X86_PGALLOC_H
 
+#include <linux/printk.h>
 #include <linux/threads.h>
 #include <linux/mm.h>		/* for struct page */
 #include <linux/pagemap.h>
@@ -128,6 +129,38 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
 	___pud_free_tlb(tlb, pud);
 }
 
+/* Allocate a pagetable pointed to by the top hardware level. */
+static inline int preallocate_sub_pgd(struct mm_struct *mm, unsigned long addr)
+{
+	const char *lvl;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	lvl = "p4d";
+	p4d = p4d_alloc(mm, pgd_offset_pgd(mm->pgd, addr), addr);
+	if (!p4d)
+		goto failed;
+
+	if (pgtable_l5_enabled())
+		return 0;
+
+	/*
+	 * On 4-level systems, the P4D layer is folded away and
+	 * the above code does no preallocation.  Below, go down
+	 * to the pud _software_ level to ensure the second
+	 * hardware level is allocated on 4-level systems too.
+	 */
+	lvl = "pud";
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		goto failed;
+	return 0;
+
+failed:
+	pr_warn_ratelimited("Failed to preallocate %s\n", lvl);
+	return -ENOMEM;
+}
+
 #if CONFIG_PGTABLE_LEVELS > 4
 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
 {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index df2261fa4f985..79806386dc42f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1318,46 +1318,16 @@ static void __init register_page_bootmem_info(void)
 static void __init preallocate_vmalloc_pages(void)
 {
 	unsigned long addr;
-	const char *lvl;
 
 	for (addr = VMALLOC_START; addr <= VMEMORY_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
-		pgd_t *pgd = pgd_offset_k(addr);
-		p4d_t *p4d;
-		pud_t *pud;
-
-		lvl = "p4d";
-		p4d = p4d_alloc(&init_mm, pgd, addr);
-		if (!p4d)
-			goto failed;
-
-		if (pgtable_l5_enabled())
-			continue;
-
-		/*
-		 * The goal here is to allocate all possibly required
-		 * hardware page tables pointed to by the top hardware
-		 * level.
-		 *
-		 * On 4-level systems, the P4D layer is folded away and
-		 * the above code does no preallocation.  Below, go down
-		 * to the pud _software_ level to ensure the second
-		 * hardware level is allocated on 4-level systems too.
-		 */
-		lvl = "pud";
-		pud = pud_alloc(&init_mm, p4d, addr);
-		if (!pud)
-			goto failed;
+		if (preallocate_sub_pgd(&init_mm, addr)) {
+			/*
+			 * The pages have to be there now or they will be
+			 * missing in process page-tables later.
+			 */
+			panic("Failed to pre-allocate pagetables for vmalloc area\n");
+		}
 	}
-
-	return;
-
-failed:
-
-	/*
-	 * The pages have to be there now or they will be missing in
-	 * process page-tables later.
-	 */
-	panic("Failed to pre-allocate %s pages for vmalloc area\n", lvl);
 }
 
 void __init arch_mm_preinit(void)

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region"
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 19:47   ` Dave Hansen
  2026-03-25 14:23   ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Various security features benefit from having process-local address
mappings. Examples include no-direct-map guest_memfd [2] and significant
optimizations for ASI [1].

As pointed out by Andy in [0], x86 already has a PGD entry that is local
to the mm, which is used for the LDT.

So, simply redefine that entry's region as "the mm-local region" and
then redefine the LDT region as a sub-region of that.

With the currently-envisaged usecases, there will be many situations
where almost no processes have any need for the mm-local region.
Therefore, avoid its overhead (memory cost of pagetables, alloc/free
overhead during fork/exit) for processes that don't use it by requiring
its users to explicitly initialize it via the new mm_local_* API.

This means that the LDT remap code can be simplified:

1. map_ldt_struct_to_user() and free_ldt_pgtables() are no longer
   required as the mm_local core code handles that automatically.

2. The sanity-check logic is unified: in both cases just walk the
   pagetables via a generic mechanism. This slightly relaxes the
   sanity-checking since lookup_address_in_pgd() is more flexible than
   pgd_to_pmd_walk(), but this seems to be worth it for the simplified
   code.

On 64-bit, the mm-local region gets a whole PGD. On 32-bit, it just
i.e. one PMD, i.e. it is completely consumed by the LDT remap - no
investigation has been done into whether it's feasible to expand the
region on 32-bit. Most likely there is no strong usecase for that
anyway.

In both cases, in order to combine the need for an on-demand mm
initialisation, combined with the desire to transparently handle
propagating mappings to userspace under KPTI, the user and kernel
pagetables are shared at the highest level possible. For PAE that means
the PTE table is shared and for 64-bit the P4D/PUD. This is implemented
by pre-allocating the first shared table when the mm-local region is
first initialised.

The PAE implementation of mm_local_map_to_user() does not allocate
pagetables, it assumes the PMD has been preallocated. To make that
assumption safer, expose PREALLOCATED_PMDs in the arch headers so that
mm_local_map_to_user() can have a BUILD_BUG_ON().

[0] https://lore.kernel.org/linux-mm/CALCETrXHbS9VXfZ80kOjiTrreM2EbapYeGp68mvJPbosUtorYA@mail.gmail.com/
[1] https://linuxasi.dev/
[2] https://lore.kernel.org/all/20250924151101.2225820-1-patrick.roy@campus.lmu.de
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 Documentation/arch/x86/x86_64/mm.rst    |   4 +-
 arch/x86/Kconfig                        |   2 +
 arch/x86/include/asm/mmu_context.h      | 119 ++++++++++++++++++++++++++++-
 arch/x86/include/asm/page.h             |  32 ++++++++
 arch/x86/include/asm/pgtable_32_areas.h |   9 ++-
 arch/x86/include/asm/pgtable_64_types.h |  12 ++-
 arch/x86/kernel/ldt.c                   | 130 +++++---------------------------
 arch/x86/mm/pgtable.c                   |  32 +-------
 include/linux/mm.h                      |  13 ++++
 include/linux/mm_types.h                |   2 +
 kernel/fork.c                           |   1 +
 mm/Kconfig                              |  11 +++
 12 files changed, 217 insertions(+), 150 deletions(-)

diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
index a6cf05d51bd8c..fa2bb7bab6a42 100644
--- a/Documentation/arch/x86/x86_64/mm.rst
+++ b/Documentation/arch/x86/x86_64/mm.rst
@@ -53,7 +53,7 @@ Complete virtual memory map with 4-level page tables
   ____________________________________________________________|___________________________________________________________
                     |            |                  |         |
    ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
-   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
+   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | MM-local kernel data. Includes LDT remap for PTI
    ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
    ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
    ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
@@ -123,7 +123,7 @@ Complete virtual memory map with 5-level page tables
   ____________________________________________________________|___________________________________________________________
                     |            |                  |         |
    ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
-   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
+   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | MM-local kernel data. Includes LDT remap for PTI
    ff11000000000000 |  -59.75 PB | ff90ffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
    ff91000000000000 |  -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
    ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8038b26ae99e0..d7073b6077c62 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -133,6 +133,7 @@ config X86
 	select ARCH_SUPPORTS_RT
 	select ARCH_SUPPORTS_AUTOFDO_CLANG
 	select ARCH_SUPPORTS_PROPELLER_CLANG    if X86_64
+	select ARCH_SUPPORTS_MM_LOCAL_REGION	if X86_64 || X86_PAE
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF		if X86_CX8
 	select ARCH_USE_MEMTEST
@@ -2323,6 +2324,7 @@ config CMDLINE_OVERRIDE
 config MODIFY_LDT_SYSCALL
 	bool "Enable the LDT (local descriptor table)" if EXPERT
 	default y
+	select MM_LOCAL_REGION if MITIGATION_PAGE_TABLE_ISOLATION || X86_PAE
 	help
 	  Linux can allow user programs to install a per-process x86
 	  Local Descriptor Table (LDT) using the modify_ldt(2) system
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ef5b507de34e2..14f75d1d7e28f 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -8,8 +8,10 @@
 
 #include <trace/events/tlb.h>
 
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
+#include <asm/pgalloc.h>
 #include <asm/debugreg.h>
 #include <asm/gsseg.h>
 #include <asm/desc.h>
@@ -59,7 +61,6 @@ static inline void init_new_context_ldt(struct mm_struct *mm)
 }
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
-void ldt_arch_exit_mmap(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline void init_new_context_ldt(struct mm_struct *mm) { }
 static inline int ldt_dup_context(struct mm_struct *oldmm,
@@ -68,7 +69,6 @@ static inline int ldt_dup_context(struct mm_struct *oldmm,
 	return 0;
 }
 static inline void destroy_context_ldt(struct mm_struct *mm) { }
-static inline void ldt_arch_exit_mmap(struct mm_struct *mm) { }
 #endif
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
@@ -223,10 +223,123 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 	return ldt_dup_context(oldmm, mm);
 }
 
+#ifdef CONFIG_MM_LOCAL_REGION
+static inline void mm_local_region_free(struct mm_struct *mm)
+{
+	if (mm_local_region_used(mm)) {
+		struct mmu_gather tlb;
+		unsigned long start = MM_LOCAL_BASE_ADDR;
+		unsigned long end = MM_LOCAL_END_ADDR;
+
+		/*
+		 * Although free_pgd_range() is intended for freeing user
+		 * page-tables, it also works out for kernel mappings on x86.
+		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
+		 * range-tracking logic in __tlb_adjust_range().
+		 */
+		tlb_gather_mmu_fullmm(&tlb, mm);
+		free_pgd_range(&tlb, start, end, start, end);
+		tlb_finish_mmu(&tlb);
+
+		mm_flags_clear(MMF_LOCAL_REGION_USED, mm);
+	}
+}
+
+#if defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) && defined(CONFIG_X86_PAE)
+static inline pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
+{
+	p4d_t *p4d;
+	pud_t *pud;
+
+	if (pgd->pgd == 0)
+		return NULL;
+
+	p4d = p4d_offset(pgd, va);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, va);
+	if (pud_none(*pud))
+		return NULL;
+
+	return pmd_offset(pud, va);
+}
+
+static inline int mm_local_map_to_user(struct mm_struct *mm)
+{
+	BUILD_BUG_ON(!PREALLOCATED_PMDS);
+	pgd_t *k_pgd = pgd_offset(mm, MM_LOCAL_BASE_ADDR);
+	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+	pmd_t *k_pmd, *u_pmd;
+	int err;
+
+	k_pmd = pgd_to_pmd_walk(k_pgd, MM_LOCAL_BASE_ADDR);
+	u_pmd = pgd_to_pmd_walk(u_pgd, MM_LOCAL_BASE_ADDR);
+
+	BUILD_BUG_ON(MM_LOCAL_END_ADDR - MM_LOCAL_BASE_ADDR > PMD_SIZE);
+
+	/* Preallocate the PTE table so it can be shared. */
+	err = pte_alloc(mm, k_pmd);
+	if (err)
+		return err;
+
+	/* Point the userspace PMD at the same PTE as the kernel PMD. */
+	set_pmd(u_pmd, *k_pmd);
+	return 0;
+}
+#elif defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)
+static inline int mm_local_map_to_user(struct mm_struct *mm)
+{
+	pgd_t *pgd;
+	int err;
+
+	err = preallocate_sub_pgd(mm, MM_LOCAL_BASE_ADDR);
+	if (err)
+		return err;
+
+	pgd = pgd_offset(mm, MM_LOCAL_BASE_ADDR);
+	set_pgd(kernel_to_user_pgdp(pgd), *pgd);
+	return 0;
+}
+#else
+static inline int mm_local_map_to_user(struct mm_struct *mm)
+{
+	WARN_ONCE(1, "mm_local_map_to_user() not implemented");
+	return -EINVAL;
+}
+#endif
+
+/*
+ * Do initial setup of the user-local region. Call from process context.
+ *
+ * Under PTI, userspace shares the pagetables for the mm-local region with the
+ * kernel (if you map stuff here, it's immediately mapped into userspace too).
+ * LDT remap. It's assuming nothing gets mapped in here that needs to be
+ * protected from Meltdown-type attacks from the current process.
+ */
+static inline int mm_local_region_init(struct mm_struct *mm)
+{
+	int err;
+
+	if (boot_cpu_has(X86_FEATURE_PTI)) {
+		err = mm_local_map_to_user(mm);
+		if (err)
+			return err;
+	}
+
+	mm_flags_set(MMF_LOCAL_REGION_USED, mm);
+
+	return 0;
+}
+
+#else
+static inline void mm_local_region_free(struct mm_struct *mm) { }
+#endif /* CONFIG_MM_LOCAL_REGION */
+
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
 	paravirt_arch_exit_mmap(mm);
-	ldt_arch_exit_mmap(mm);
+	mm_local_region_free(mm);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 416dc88e35c15..4de4715c3b40f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -78,6 +78,38 @@ static __always_inline u64 __is_canonical_address(u64 vaddr, u8 vaddr_bits)
 	return __canonical_address(vaddr, vaddr_bits) == vaddr;
 }
 
+#ifdef CONFIG_X86_PAE
+
+/*
+ * In PAE mode, we need to do a cr3 reload (=tlb flush) when
+ * updating the top-level pagetable entries to guarantee the
+ * processor notices the update.  Since this is expensive, and
+ * all 4 top-level entries are used almost immediately in a
+ * new process's life, we just pre-populate them here.
+ */
+#define PREALLOCATED_PMDS	PTRS_PER_PGD
+/*
+ * "USER_PMDS" are the PMDs for the user copy of the page tables when
+ * PTI is enabled. They do not exist when PTI is disabled.  Note that
+ * this is distinct from the user _portion_ of the kernel page tables
+ * which always exists.
+ *
+ * We allocate separate PMDs for the kernel part of the user page-table
+ * when PTI is enabled. We need them to map the per-process LDT into the
+ * user-space page-table.
+ */
+#define PREALLOCATED_USER_PMDS (boot_cpu_has(X86_FEATURE_PTI) ? KERNEL_PGD_PTRS : 0)
+#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
+
+#else  /* !CONFIG_X86_PAE */
+
+/* No need to prepopulate any pagetable entries in non-PAE modes. */
+#define PREALLOCATED_PMDS	0
+#define PREALLOCATED_USER_PMDS	0
+#define MAX_PREALLOCATED_USER_PMDS 0
+
+#endif	/* CONFIG_X86_PAE */
+
 #endif	/* __ASSEMBLER__ */
 
 #include <asm-generic/memory_model.h>
diff --git a/arch/x86/include/asm/pgtable_32_areas.h b/arch/x86/include/asm/pgtable_32_areas.h
index 921148b429676..7fccb887f8b33 100644
--- a/arch/x86/include/asm/pgtable_32_areas.h
+++ b/arch/x86/include/asm/pgtable_32_areas.h
@@ -30,9 +30,14 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define CPU_ENTRY_AREA_BASE	\
 	((FIXADDR_TOT_START - PAGE_SIZE*(CPU_ENTRY_AREA_PAGES+1)) & PMD_MASK)
 
-#define LDT_BASE_ADDR		\
-	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
+/*
+ * On 32-bit the mm-local region is currently completely consumed by the LDT
+ * remap.
+ */
+#define MM_LOCAL_BASE_ADDR	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
+#define MM_LOCAL_END_ADDR	(MM_LOCAL_BASE_ADDR + PMD_SIZE)
 
+#define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
 #define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
 
 #define PKMAP_BASE		\
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 7eb61ef6a185f..1181565966405 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -5,8 +5,11 @@
 #include <asm/sparsemem.h>
 
 #ifndef __ASSEMBLER__
+#include <linux/build_bug.h>
 #include <linux/types.h>
 #include <asm/kaslr.h>
+#include <asm/page_types.h>
+#include <uapi/asm/ldt.h>
 
 /*
  * These are used to make use of C type-checking..
@@ -100,9 +103,12 @@ extern unsigned int ptrs_per_p4d;
 #define GUARD_HOLE_BASE_ADDR	(GUARD_HOLE_PGD_ENTRY << PGDIR_SHIFT)
 #define GUARD_HOLE_END_ADDR	(GUARD_HOLE_BASE_ADDR + GUARD_HOLE_SIZE)
 
-#define LDT_PGD_ENTRY		-240UL
-#define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
-#define LDT_END_ADDR		(LDT_BASE_ADDR + PGDIR_SIZE)
+#define MM_LOCAL_PGD_ENTRY	-240UL
+#define MM_LOCAL_BASE_ADDR	(MM_LOCAL_PGD_ENTRY << PGDIR_SHIFT)
+#define MM_LOCAL_END_ADDR	((MM_LOCAL_PGD_ENTRY + 1) << PGDIR_SHIFT)
+
+#define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
+#define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
 
 #define __VMALLOC_BASE_L4	0xffffc90000000000UL
 #define __VMALLOC_BASE_L5 	0xffa0000000000000UL
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 40c5bf97dd5cc..fb2a1914539f8 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -31,6 +31,8 @@
 
 #include <xen/xen.h>
 
+/* LDTs are double-buffered, the buffers are called slots. */
+#define LDT_NUM_SLOTS		2
 /* This is a multiple of PAGE_SIZE. */
 #define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
 
@@ -186,100 +188,36 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 
 #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
 
-static void do_sanity_check(struct mm_struct *mm,
-			    bool had_kernel_mapping,
-			    bool had_user_mapping)
+static void sanity_check_ldt_mapping(struct mm_struct *mm)
 {
+	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+	unsigned int k_level, u_level;
+	bool had_kernel, had_user;
+
+	had_kernel = lookup_address_in_pgd(k_pgd, LDT_BASE_ADDR, &k_level);
+	had_user   = lookup_address_in_pgd(u_pgd, LDT_BASE_ADDR, &u_level);
+
 	if (mm->context.ldt) {
 		/*
 		 * We already had an LDT.  The top-level entry should already
 		 * have been allocated and synchronized with the usermode
 		 * tables.
 		 */
-		WARN_ON(!had_kernel_mapping);
+		WARN_ON(!had_kernel);
 		if (boot_cpu_has(X86_FEATURE_PTI))
-			WARN_ON(!had_user_mapping);
+			WARN_ON(!had_user);
 	} else {
 		/*
 		 * This is the first time we're mapping an LDT for this process.
 		 * Sync the pgd to the usermode tables.
 		 */
-		WARN_ON(had_kernel_mapping);
+		WARN_ON(had_kernel);
 		if (boot_cpu_has(X86_FEATURE_PTI))
-			WARN_ON(had_user_mapping);
+			WARN_ON(had_user);
 	}
 }
 
-#ifdef CONFIG_X86_PAE
-
-static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
-{
-	p4d_t *p4d;
-	pud_t *pud;
-
-	if (pgd->pgd == 0)
-		return NULL;
-
-	p4d = p4d_offset(pgd, va);
-	if (p4d_none(*p4d))
-		return NULL;
-
-	pud = pud_offset(p4d, va);
-	if (pud_none(*pud))
-		return NULL;
-
-	return pmd_offset(pud, va);
-}
-
-static void map_ldt_struct_to_user(struct mm_struct *mm)
-{
-	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
-	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
-	pmd_t *k_pmd, *u_pmd;
-
-	k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
-	u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
-
-	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
-		set_pmd(u_pmd, *k_pmd);
-}
-
-static void sanity_check_ldt_mapping(struct mm_struct *mm)
-{
-	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
-	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
-	bool had_kernel, had_user;
-	pmd_t *k_pmd, *u_pmd;
-
-	k_pmd      = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
-	u_pmd      = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
-	had_kernel = (k_pmd->pmd != 0);
-	had_user   = (u_pmd->pmd != 0);
-
-	do_sanity_check(mm, had_kernel, had_user);
-}
-
-#else /* !CONFIG_X86_PAE */
-
-static void map_ldt_struct_to_user(struct mm_struct *mm)
-{
-	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
-
-	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
-		set_pgd(kernel_to_user_pgdp(pgd), *pgd);
-}
-
-static void sanity_check_ldt_mapping(struct mm_struct *mm)
-{
-	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
-	bool had_kernel = (pgd->pgd != 0);
-	bool had_user   = (kernel_to_user_pgdp(pgd)->pgd != 0);
-
-	do_sanity_check(mm, had_kernel, had_user);
-}
-
-#endif /* CONFIG_X86_PAE */
-
 /*
  * If PTI is enabled, this maps the LDT into the kernelmode and
  * usermode tables for the given mm.
@@ -295,6 +233,8 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return 0;
 
+	mm_local_region_init(mm);
+
 	/*
 	 * Any given ldt_struct should have map_ldt_struct() called at most
 	 * once.
@@ -339,9 +279,6 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 		pte_unmap_unlock(ptep, ptl);
 	}
 
-	/* Propagate LDT mapping to the user page-table */
-	map_ldt_struct_to_user(mm);
-
 	ldt->slot = slot;
 	return 0;
 }
@@ -390,28 +327,6 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
 }
 #endif /* CONFIG_MITIGATION_PAGE_TABLE_ISOLATION */
 
-static void free_ldt_pgtables(struct mm_struct *mm)
-{
-#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
-	struct mmu_gather tlb;
-	unsigned long start = LDT_BASE_ADDR;
-	unsigned long end = LDT_END_ADDR;
-
-	if (!boot_cpu_has(X86_FEATURE_PTI))
-		return;
-
-	/*
-	 * Although free_pgd_range() is intended for freeing user
-	 * page-tables, it also works out for kernel mappings on x86.
-	 * We use tlb_gather_mmu_fullmm() to avoid confusing the
-	 * range-tracking logic in __tlb_adjust_range().
-	 */
-	tlb_gather_mmu_fullmm(&tlb, mm);
-	free_pgd_range(&tlb, start, end, start, end);
-	tlb_finish_mmu(&tlb);
-#endif
-}
-
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -472,7 +387,6 @@ int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 
 	retval = map_ldt_struct(mm, new_ldt, 0);
 	if (retval) {
-		free_ldt_pgtables(mm);
 		free_ldt_struct(new_ldt);
 		goto out_unlock;
 	}
@@ -494,11 +408,6 @@ void destroy_context_ldt(struct mm_struct *mm)
 	mm->context.ldt = NULL;
 }
 
-void ldt_arch_exit_mmap(struct mm_struct *mm)
-{
-	free_ldt_pgtables(mm);
-}
-
 static int read_ldt(void __user *ptr, unsigned long bytecount)
 {
 	struct mm_struct *mm = current->mm;
@@ -645,10 +554,9 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 		/*
 		 * This only can fail for the first LDT setup. If an LDT is
 		 * already installed then the PTE page is already
-		 * populated. Mop up a half populated page table.
+		 * populated.
 		 */
-		if (!WARN_ON_ONCE(old_ldt))
-			free_ldt_pgtables(mm);
+		WARN_ON_ONCE(!old_ldt);
 		free_ldt_struct(new_ldt);
 		goto out_unlock;
 	}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 2e5ecfdce73c3..e4132696c9ef2 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -111,29 +111,6 @@ static void pgd_dtor(pgd_t *pgd)
  */
 
 #ifdef CONFIG_X86_PAE
-/*
- * In PAE mode, we need to do a cr3 reload (=tlb flush) when
- * updating the top-level pagetable entries to guarantee the
- * processor notices the update.  Since this is expensive, and
- * all 4 top-level entries are used almost immediately in a
- * new process's life, we just pre-populate them here.
- */
-#define PREALLOCATED_PMDS	PTRS_PER_PGD
-
-/*
- * "USER_PMDS" are the PMDs for the user copy of the page tables when
- * PTI is enabled. They do not exist when PTI is disabled.  Note that
- * this is distinct from the user _portion_ of the kernel page tables
- * which always exists.
- *
- * We allocate separate PMDs for the kernel part of the user page-table
- * when PTI is enabled. We need them to map the per-process LDT into the
- * user-space page-table.
- */
-#define PREALLOCATED_USER_PMDS	 (boot_cpu_has(X86_FEATURE_PTI) ? \
-					KERNEL_PGD_PTRS : 0)
-#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
-
 void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
 {
 	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
@@ -150,12 +127,6 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
 	 */
 	flush_tlb_mm(mm);
 }
-#else  /* !CONFIG_X86_PAE */
-
-/* No need to prepopulate any pagetable entries in non-PAE modes. */
-#define PREALLOCATED_PMDS	0
-#define PREALLOCATED_USER_PMDS	 0
-#define MAX_PREALLOCATED_USER_PMDS 0
 #endif	/* CONFIG_X86_PAE */
 
 static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
@@ -375,6 +346,9 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
+	/* Should be cleaned up in mmap exit path. */
+	VM_WARN_ON_ONCE(mm_local_region_used(mm));
+
 	pgd_mop_up_pmds(mm, pgd);
 	pgd_dtor(pgd);
 	paravirt_pgd_free(mm, pgd);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 70747b53c7da9..413dc707cff9b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -906,6 +906,19 @@ static inline void mm_flags_clear_all(struct mm_struct *mm)
 	bitmap_zero(ACCESS_PRIVATE(&mm->flags, __mm_flags), NUM_MM_FLAG_BITS);
 }
 
+#ifdef CONFIG_MM_LOCAL_REGION
+static inline bool mm_local_region_used(struct mm_struct *mm)
+{
+	return mm_flags_test(MMF_LOCAL_REGION_USED, mm);
+}
+#else
+static inline bool mm_local_region_used(struct mm_struct *mm)
+{
+	VM_WARN_ON_ONCE(mm_flags_test(MMF_LOCAL_REGION_USED, mm));
+	return false;
+}
+#endif
+
 extern const struct vm_operations_struct vma_dummy_vm_ops;
 
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cee934c6e78ec..0ca7cb7da918f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1944,6 +1944,8 @@ enum {
 
 #define MMF_USER_HWCAP		32	/* user-defined HWCAPs */
 
+#define MMF_LOCAL_REGION_USED	33
+
 #define MMF_INIT_LEGACY_MASK	(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
 				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
diff --git a/kernel/fork.c b/kernel/fork.c
index 68cf0109dde3c..ff075c74333fe 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1153,6 +1153,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 fail_nocontext:
 	mm_free_id(mm);
 fail_noid:
+	WARN_ON_ONCE(mm_local_region_used(mm));
 	mm_free_pgd(mm);
 fail_nopgd:
 	futex_hash_free(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index ebd8ea353687e..2813059df9c1c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1319,6 +1319,10 @@ config SECRETMEM
 	default y
 	bool "Enable memfd_secret() system call" if EXPERT
 	depends on ARCH_HAS_SET_DIRECT_MAP
+	# Soft dependency, for optimisation.
+	imply MM_LOCAL_REGION
+	imply MERMAP
+	imply PAGE_ALLOC_UNMAPPED
 	help
 	  Enable the memfd_secret() system call with the ability to create
 	  memory areas visible only in the context of the owning process and
@@ -1471,6 +1475,13 @@ config LAZY_MMU_MODE_KUNIT_TEST
 
 	  If unsure, say N.
 
+config ARCH_SUPPORTS_MM_LOCAL_REGION
+	def_bool n
+
+config MM_LOCAL_REGION
+	bool
+	depends on ARCH_SUPPORTS_MM_LOCAL_REGION
+
 source "mm/damon/Kconfig"
 
 endmenu

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

In commit bfe3d8f6313d ("x86/tlb: Restrict access to tlbstate") some
low-level logic (the important detail here is flush_tlb_info) was hidden
from modules, along with functions associated with that data.

Later, the set of functions defined here changed and there are now a
bunch of flush_tlb_*() functions that do not depend on x86 internals
like flush_tlb_info.

This leads to some build fragility: KVM (which can be a module) cares
about TLB flushing and includes {linux->asm}/mmu_context.h which
includes asm/tlb.h and asm/tlbflush.h. This x86 TLB code expects these
helpers to be defined (e.g. tlb_flush() calls flush_tlb_mm_range()).

Modules probably shouldn't call these helpers - luckily this is already
enforced by the lack of EXPORT_SYMBOL(). Therefore keep things simple
and just expose the declarations anyway to prevent build failures.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 arch/x86/include/asm/tlbflush.h | 43 +++++++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 0545fe75c3fa1..56cff03b65aa2 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,6 @@ struct flush_tlb_info {
 	u8			trim_cpumask;
 };
 
-void flush_tlb_local(void);
 void flush_tlb_one_user(unsigned long addr);
 void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
@@ -325,26 +324,6 @@ static inline void mm_clear_asid_transition(struct mm_struct *mm) { }
 static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */
 
-#define flush_tlb_mm(mm)						\
-		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
-
-#define flush_tlb_range(vma, start, end)				\
-	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
-			   ((vma)->vm_flags & VM_HUGETLB)		\
-				? huge_page_shift(hstate_vma(vma))	\
-				: PAGE_SHIFT, true)
-
-extern void flush_tlb_all(void);
-extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift,
-				bool freed_tables);
-extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
-
-static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
-{
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
-}
-
 static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 {
 	bool should_defer = false;
@@ -513,4 +492,26 @@ static inline void __native_tlb_flush_global(unsigned long cr4)
 	native_write_cr4(cr4 ^ X86_CR4_PGE);
 	native_write_cr4(cr4);
 }
+
+#define flush_tlb_mm(mm)						\
+		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
+
+#define flush_tlb_range(vma, start, end)				\
+	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
+			   ((vma)->vm_flags & VM_HUGETLB)		\
+				? huge_page_shift(hstate_vma(vma))	\
+				: PAGE_SHIFT, true)
+
+void flush_tlb_local(void);
+extern void flush_tlb_all(void);
+extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables);
+extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+
+static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
+{
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
+}
+
 #endif /* _ASM_X86_TLBFLUSH_H */

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range()
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (2 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Preparatory patch, no functional change intended.

To prepare for making this function more generic, convert the boolean
"create" arg into a flags arg with a single flag that has the same
meaning.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/internal.h | 10 ++++++++++
 mm/memory.c   | 29 +++++++++++++++++------------
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f98f4746ac412..4b389431b1639 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1870,4 +1870,14 @@ static inline int get_sysctl_max_map_count(void)
 	return READ_ONCE(sysctl_max_map_count);
 }
 
+/*
+ * Create a mapping if it doesn't exist. (Otherwise, skip regions with no
+ * existing mapping, and return an error for regions with no leaf pagetable).
+ */
+#define PGRANGE_CREATE		(1 << 0)
+
+int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+			  unsigned long size, pte_fn_t fn,
+			  void *data, unsigned int flags);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 219b9bf6cae00..7e55014e5560b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3208,9 +3208,10 @@ EXPORT_SYMBOL(vm_iomap_memory);
 
 static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     unsigned long addr, unsigned long end,
-				     pte_fn_t fn, void *data, bool create,
+				     pte_fn_t fn, void *data, unsigned int flags,
 				     pgtbl_mod_mask *mask)
 {
+	bool create = flags & PGRANGE_CREATE;
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
@@ -3251,10 +3252,11 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 				     unsigned long addr, unsigned long end,
-				     pte_fn_t fn, void *data, bool create,
+				     pte_fn_t fn, void *data, unsigned int flags,
 				     pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
+	bool create = flags & PGRANGE_CREATE;
 	unsigned long next;
 	int err = 0;
 
@@ -3279,7 +3281,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 			pmd_clear_bad(pmd);
 		}
 		err = apply_to_pte_range(mm, pmd, addr, next,
-					 fn, data, create, mask);
+					 fn, data, flags, mask);
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
@@ -3289,10 +3291,11 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 
 static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 				     unsigned long addr, unsigned long end,
-				     pte_fn_t fn, void *data, bool create,
+				     pte_fn_t fn, void *data, unsigned int flags,
 				     pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
+	bool create = flags & PGRANGE_CREATE;
 	unsigned long next;
 	int err = 0;
 
@@ -3325,10 +3328,11 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 
 static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 				     unsigned long addr, unsigned long end,
-				     pte_fn_t fn, void *data, bool create,
+				     pte_fn_t fn, void *data, unsigned int flags,
 				     pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
+	bool create = flags & PGRANGE_CREATE;
 	unsigned long next;
 	int err = 0;
 
@@ -3351,7 +3355,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 			p4d_clear_bad(p4d);
 		}
 		err = apply_to_pud_range(mm, p4d, addr, next,
-					 fn, data, create, mask);
+					 fn, data, flags, mask);
 		if (err)
 			break;
 	} while (p4d++, addr = next, addr != end);
@@ -3359,11 +3363,12 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	return err;
 }
 
-static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
-				 unsigned long size, pte_fn_t fn,
-				 void *data, bool create)
+int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+			  unsigned long size, pte_fn_t fn,
+			  void *data, unsigned int flags)
 {
 	pgd_t *pgd;
+	bool create = flags & PGRANGE_CREATE;
 	unsigned long start = addr, next;
 	unsigned long end = addr + size;
 	pgtbl_mod_mask mask = 0;
@@ -3387,7 +3392,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			pgd_clear_bad(pgd);
 		}
 		err = apply_to_p4d_range(mm, pgd, addr, next,
-					 fn, data, create, &mask);
+					 fn, data, flags, &mask);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -3405,7 +3410,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			unsigned long size, pte_fn_t fn, void *data)
 {
-	return __apply_to_page_range(mm, addr, size, fn, data, true);
+	return __apply_to_page_range(mm, addr, size, fn, data, PGRANGE_CREATE);
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
@@ -3419,7 +3424,7 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
 int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
 				 unsigned long size, pte_fn_t fn, void *data)
 {
-	return __apply_to_page_range(mm, addr, size, fn, data, false);
+	return __apply_to_page_range(mm, addr, size, fn, data, 0);
 }
 
 /*

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 05/22] mm: Add more flags for __apply_to_page_range()
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (3 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Add two flags to make this API more generic:

1. Separate "create" into two levels - one to allow creating new
   mappings without allocating pagetables, and one for the current
   behaviour that allows both of these.

2. Create a new flag to report that the caller has taken care of
   synchronization and no locks are required.

Both of these will serve to allow calling this API from restricted
contexts where allocation and pagetable locking are not possible.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/internal.h | 19 ++++++++++++++++++-
 mm/memory.c   | 59 ++++++++++++++++++++++++++++++++++-------------------------
 2 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4b389431b1639..f4c59534670e4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1872,9 +1872,26 @@ static inline int get_sysctl_max_map_count(void)
 
 /*
  * Create a mapping if it doesn't exist. (Otherwise, skip regions with no
- * existing mapping, and return an error for regions with no leaf pagetable).
+ * existing mapping). Most users will want PGRANGE_ALLOC or 0 instead.
  */
 #define PGRANGE_CREATE		(1 << 0)
+/*
+ * Allocate a pagetable if one is missing. (Otherwise, return an error for
+ * regions with no leaf pagetable). Also implies PGRANGE_CREATE.
+ */
+#define PGRANGE_ALLOC		(1 << 1)
+/*
+ * Do not take any locks. This means the caller has taken care of
+ * synchronisation. This is incompatible with PGRANGE_ALLOC and also with
+ * mm=&init_mm.
+ */
+#define PGRANGE_NOLOCK		(1 << 2)
+
+
+static inline bool pgrange_create(unsigned int flags)
+{
+	return flags & (PGRANGE_CREATE | PGRANGE_ALLOC);
+}
 
 int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			  unsigned long size, pte_fn_t fn,
diff --git a/mm/memory.c b/mm/memory.c
index 7e55014e5560b..9f0ccbbbc4e59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3211,30 +3211,36 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     pte_fn_t fn, void *data, unsigned int flags,
 				     pgtbl_mod_mask *mask)
 {
-	bool create = flags & PGRANGE_CREATE;
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
 
-	if (create) {
+	if (flags & PGRANGE_ALLOC) {
+		VM_WARN_ON(flags & PGRANGE_NOLOCK);
+
 		mapped_pte = pte = (mm == &init_mm) ?
 			pte_alloc_kernel_track(pmd, addr, mask) :
 			pte_alloc_map_lock(mm, pmd, addr, &ptl);
+
 		if (!pte)
 			return -ENOMEM;
 	} else {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_offset_kernel(pmd, addr) :
-			pte_offset_map_lock(mm, pmd, addr, &ptl);
+		if (mm == &init_mm)
+			pte = pte_offset_kernel(pmd, addr);
+		else if (flags & PGRANGE_NOLOCK)
+			pte = pte_offset_map(pmd, addr);
+		else
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
 			return -EINVAL;
+		mapped_pte = pte;
 	}
 
 	lazy_mmu_mode_enable();
 
 	if (fn) {
 		do {
-			if (create || !pte_none(ptep_get(pte))) {
+			if (pgrange_create(flags) || !pte_none(ptep_get(pte))) {
 				err = fn(pte, addr, data);
 				if (err)
 					break;
@@ -3245,8 +3251,12 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	lazy_mmu_mode_disable();
 
-	if (mm != &init_mm)
-		pte_unmap_unlock(mapped_pte, ptl);
+	if (mm != &init_mm) {
+		if (flags & PGRANGE_NOLOCK)
+			pte_unmap(mapped_pte);
+		else
+			pte_unmap_unlock(mapped_pte, ptl);
+	}
 	return err;
 }
 
@@ -3256,13 +3266,12 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 				     pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
-	bool create = flags & PGRANGE_CREATE;
 	unsigned long next;
 	int err = 0;
 
 	BUG_ON(pud_leaf(*pud));
 
-	if (create) {
+	if (pgrange_create(flags)) {
 		pmd = pmd_alloc_track(mm, pud, addr, mask);
 		if (!pmd)
 			return -ENOMEM;
@@ -3271,12 +3280,12 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 	}
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*pmd) && !create)
+		if (pmd_none(*pmd) && !pgrange_create(flags))
 			continue;
 		if (WARN_ON_ONCE(pmd_leaf(*pmd)))
 			return -EINVAL;
 		if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
-			if (!create)
+			if (!pgrange_create(flags))
 				continue;
 			pmd_clear_bad(pmd);
 		}
@@ -3295,11 +3304,10 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 				     pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
-	bool create = flags & PGRANGE_CREATE;
 	unsigned long next;
 	int err = 0;
 
-	if (create) {
+	if (pgrange_create(flags)) {
 		pud = pud_alloc_track(mm, p4d, addr, mask);
 		if (!pud)
 			return -ENOMEM;
@@ -3308,17 +3316,17 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	}
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud) && !create)
+		if (pud_none(*pud) && !pgrange_create(flags))
 			continue;
 		if (WARN_ON_ONCE(pud_leaf(*pud)))
 			return -EINVAL;
 		if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
-			if (!create)
+			if (!pgrange_create(flags))
 				continue;
 			pud_clear_bad(pud);
 		}
 		err = apply_to_pmd_range(mm, pud, addr, next,
-					 fn, data, create, mask);
+					 fn, data, flags, mask);
 		if (err)
 			break;
 	} while (pud++, addr = next, addr != end);
@@ -3332,11 +3340,10 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 				     pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
-	bool create = flags & PGRANGE_CREATE;
 	unsigned long next;
 	int err = 0;
 
-	if (create) {
+	if (pgrange_create(flags)) {
 		p4d = p4d_alloc_track(mm, pgd, addr, mask);
 		if (!p4d)
 			return -ENOMEM;
@@ -3345,12 +3352,12 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	}
 	do {
 		next = p4d_addr_end(addr, end);
-		if (p4d_none(*p4d) && !create)
+		if (p4d_none(*p4d) && !pgrange_create(flags))
 			continue;
 		if (WARN_ON_ONCE(p4d_leaf(*p4d)))
 			return -EINVAL;
 		if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) {
-			if (!create)
+			if (!pgrange_create(flags))
 				continue;
 			p4d_clear_bad(p4d);
 		}
@@ -3368,7 +3375,6 @@ int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			  void *data, unsigned int flags)
 {
 	pgd_t *pgd;
-	bool create = flags & PGRANGE_CREATE;
 	unsigned long start = addr, next;
 	unsigned long end = addr + size;
 	pgtbl_mod_mask mask = 0;
@@ -3376,18 +3382,21 @@ int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 
 	if (WARN_ON(addr >= end))
 		return -EINVAL;
+	if (WARN_ON(flags & PGRANGE_NOLOCK &&
+		    (mm == &init_mm || flags & PGRANGE_ALLOC)))
+		return -EINVAL;
 
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		if (pgd_none(*pgd) && !create)
+		if (pgd_none(*pgd) && !pgrange_create(flags))
 			continue;
 		if (WARN_ON_ONCE(pgd_leaf(*pgd))) {
 			err = -EINVAL;
 			break;
 		}
 		if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
-			if (!create)
+			if (!pgrange_create(flags))
 				continue;
 			pgd_clear_bad(pgd);
 		}
@@ -3410,7 +3419,7 @@ int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			unsigned long size, pte_fn_t fn, void *data)
 {
-	return __apply_to_page_range(mm, addr, size, fn, data, PGRANGE_CREATE);
+	return __apply_to_page_range(mm, addr, size, fn, data, PGRANGE_ALLOC);
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 06/22] x86/mm: introduce the mermap
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (4 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

The mermap provides a fast way to create ephemeral mm-local mappings of
physical pages. The purpose of this is to access pages that have been
removed from the direct map. Potential use cases are:

1. For zeroing __GFP_UNMAPPED pages (added in a later patch).

2. For populating guest_memfd pages that are protected by the
   GUEST_MEMFD_NO_DIRECT_MAP feature [0].

3. For efficient access of pages protected by Address Space Isolation
   [1].

[0] https://lore.kernel.org/all/20250924151101.2225820-1-patrick.roy@campus.lmu.de/
[1] https://linuxasi.dev

The details of this mechanism are described in the API comments. However
the key idea is to use CPU-local virtual regions to avoid a need for
synchronizing. On x86, this can also be used to prevent TLB shootdowns.

Because the virtual region is CPU-local, allocating from the mermap
disables migration. The caller is forbidden to use the returned value
from any other context, and migration is re-enabled when it's freed.

One might notice that mermap_get() bears a strong similarity to
kmap_local_page(). The most important differences between mermap_get()
and kmap_local_page() are:

1. mermap_get() allows mapping variable sizes while kmap_local_page()
   specifically maps a single order-0 page.
2. As a consequence of 1 (combined with the need for mermap_get() to be
   an extremely simple allocator), mermap_get() should be expected to
   fail, while kmap_local_page() is guaranteed to work up to a certain
   degree of nesting.
3. While the mappings provided by kmap_local_page() are _logically_
   local to the calling context (it's a bug for software to access them
   from elsewhere), they are _physically_ installed into the shared
   kernel pagetables. This means their locality doesn't provide any
   protection from hardware attacks. In contrast, the mermap is
   physically local to the creating mm, taking advantage of the new
   mm-local kernel address region.

So that the mermap is available even in contexts where failure is not
tolerable there is also a _reserved() variant, which is fixed at
allocating a single base page. This is useful, for example, for zeroing
__GFP_UNMAPPED pages, where handling failure would be extremely
inconvenient. The _reserved() variant is simply implemented by leaving
one base-page space unavailable for non-_reserved allocations, and
requiring an atomic context.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 arch/x86/Kconfig                        |   1 +
 arch/x86/include/asm/mermap.h           |  23 +++
 arch/x86/include/asm/pgtable_64_types.h |   8 +-
 include/linux/mermap.h                  |  63 ++++++
 include/linux/mermap_types.h            |  41 ++++
 include/linux/mm_types.h                |   4 +
 kernel/fork.c                           |   5 +
 mm/Kconfig                              |  10 +
 mm/Makefile                             |   1 +
 mm/mermap.c                             | 334 ++++++++++++++++++++++++++++++++
 mm/pgalloc-track.h                      |   6 +
 11 files changed, 495 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d7073b6077c62..f093252b5eab5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -37,6 +37,7 @@ config X86_64
 	select ZONE_DMA32
 	select EXECMEM if DYNAMIC_FTRACE
 	select ACPI_MRRM if ACPI
+	select ARCH_SUPPORTS_MERMAP
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/include/asm/mermap.h b/arch/x86/include/asm/mermap.h
new file mode 100644
index 0000000000000..9d7614716b718
--- /dev/null
+++ b/arch/x86/include/asm/mermap.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_MERMAP_H
+#define _ASM_X86_MERMAP_H
+
+#include <asm/tlbflush.h>
+
+static inline void arch_mermap_flush_tlb(void)
+{
+	/*
+	 * No shootdown allowed, IRQs may be off. Luckily other CPUs are not
+	 * allowed to access our region so the stale mappings are harmless, as
+	 * long as they still point to data belonging to this process.
+	 */
+	__flush_tlb_all();
+}
+
+static inline bool arch_mermap_pgprot_allowed(pgprot_t prot)
+{
+	/* Mermap is mm-local so global mappings would be a bug. */
+	return !(pgprot_val(prot) & _PAGE_GLOBAL);
+}
+
+#endif /* _ASM_X86_MERMAP_H */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 1181565966405..fb6c3daacfeb8 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -105,11 +105,17 @@ extern unsigned int ptrs_per_p4d;
 
 #define MM_LOCAL_PGD_ENTRY	-240UL
 #define MM_LOCAL_BASE_ADDR	(MM_LOCAL_PGD_ENTRY << PGDIR_SHIFT)
-#define MM_LOCAL_END_ADDR	((MM_LOCAL_PGD_ENTRY + 1) << PGDIR_SHIFT)
+#define MM_LOCAL_START_ADDR	((MM_LOCAL_PGD_ENTRY) << PGDIR_SHIFT)
+#define MM_LOCAL_END_ADDR	(MM_LOCAL_START_ADDR + (1UL << PGDIR_SHIFT))
 
 #define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
 #define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
 
+#define MERMAP_BASE_ADDR	LDT_END_ADDR
+#define MERMAP_CPU_REGION_SIZE	PMD_SIZE
+#define MERMAP_SIZE		(MERMAP_CPU_REGION_SIZE * NR_CPUS)
+#define MERMAP_END_ADDR		(MERMAP_BASE_ADDR + (NR_CPUS * MERMAP_CPU_REGION_SIZE))
+
 #define __VMALLOC_BASE_L4	0xffffc90000000000UL
 #define __VMALLOC_BASE_L5 	0xffa0000000000000UL
 
diff --git a/include/linux/mermap.h b/include/linux/mermap.h
new file mode 100644
index 0000000000000..5457dcb8c9789
--- /dev/null
+++ b/include/linux/mermap.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MERMAP_H
+#define _LINUX_MERMAP_H
+
+#include <linux/mermap_types.h>
+#include <linux/mm.h>
+
+#ifdef CONFIG_MERMAP
+
+#include <asm/mermap.h>
+
+int mermap_mm_prepare(struct mm_struct *mm);
+void mermap_mm_init(struct mm_struct *mm);
+void mermap_mm_teardown(struct mm_struct *mm);
+
+/* Can the mermap be called from this context? */
+static inline bool mermap_ready(void)
+{
+	return in_task() && current->mm && current->mm->mermap.cpu;
+}
+
+struct mermap_alloc *mermap_get(struct page *page, unsigned long size, pgprot_t prot);
+void *mermap_get_reserved(struct page *page, pgprot_t prot);
+void mermap_put(struct mermap_alloc *alloc);
+
+static inline void *mermap_addr(struct mermap_alloc *alloc)
+{
+	return (void *)alloc->base;
+}
+
+/*
+ * arch_mermap_flush_tlb() is called before a part of the local CPU's mermap
+ * region is remapped to a new address. No other CPU is allowed to _access_ that
+ * region, but the region was mapped there.
+ *
+ * This may be called with IRQs off.
+ *
+ * On arm64, this will need to be a broadcast TLB flush. Although the other CPUs
+ * are forbidden to access the region, they can leak the data that was mapped
+ * there via CPU exploits. Violating break-before-make would mean the data
+ * available to these CPU exploits is unpredictable.
+ */
+extern void arch_mermap_flush_tlb(void);
+extern bool arch_mermap_pgprot_allowed(pgprot_t prot);
+
+#if IS_ENABLED(CONFIG_KUNIT)
+struct mermap_alloc *__mermap_get(struct mm_struct *mm, struct page *page,
+			unsigned long size, pgprot_t prot, bool use_reserve);
+void __mermap_put(struct mm_struct *mm, struct mermap_alloc *alloc);
+unsigned long mermap_cpu_base(int cpu);
+unsigned long mermap_cpu_end(int cpu);
+#endif
+
+#else /* CONFIG_MERMAP */
+
+static inline int mermap_mm_prepare(struct mm_struct *mm) { return 0; }
+static inline void mermap_mm_init(struct mm_struct *mm) { }
+static inline void mermap_mm_teardown(struct mm_struct *mm) { }
+static inline bool mermap_ready(void) { return false; }
+
+#endif /* CONFIG_MERMAP */
+
+#endif /* _LINUX_MERMAP_H */
diff --git a/include/linux/mermap_types.h b/include/linux/mermap_types.h
new file mode 100644
index 0000000000000..c1c83b223c28d
--- /dev/null
+++ b/include/linux/mermap_types.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MERMAP_TYPES_H
+#define _LINUX_MERMAP_TYPES_H
+
+#include <linux/mutex.h>
+#include <linux/percpu.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_MERMAP
+
+/* Tracks an individual allocation in the mermap. */
+struct mermap_alloc {
+	/* Currently allocated. */
+	bool in_use;
+	/* Requires flush before reallocating. */
+	bool need_flush;
+	unsigned long base;
+	/* Non-inclusive. */
+	unsigned long end;
+};
+
+struct mermap_cpu {
+	/* Next address immediately available for alloc (no TLB flush needed). */
+	unsigned long next_addr;
+	struct mermap_alloc normal_allocs[3];
+	struct mermap_alloc reserve_alloc;
+};
+
+struct mermap {
+	struct mutex init_lock;
+	struct mermap_cpu __percpu *cpu;
+};
+
+#else /* CONFIG_MERMAP */
+
+struct mermap {};
+
+#endif /* CONFIG_MERMAP */
+
+#endif /* _LINUX_MERMAP_TYPES_H */
+
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0ca7cb7da918f..2c60a451f96e7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -7,6 +7,7 @@
 #include <linux/auxvec.h>
 #include <linux/kref.h>
 #include <linux/list.h>
+#include <linux/mermap_types.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
 #include <linux/maple_tree.h>
@@ -34,6 +35,7 @@
 struct address_space;
 struct futex_private_hash;
 struct mem_cgroup;
+struct mermap;
 
 typedef struct {
 	unsigned long f;
@@ -1172,6 +1174,8 @@ struct mm_struct {
 		atomic_t membarrier_state;
 #endif
 
+		struct mermap mermap;
+
 		/**
 		 * @mm_users: The number of users including userspace.
 		 *
diff --git a/kernel/fork.c b/kernel/fork.c
index ff075c74333fe..2770b4d296846 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -13,6 +13,7 @@
  */
 
 #include <linux/anon_inodes.h>
+#include <linux/mermap.h>
 #include <linux/slab.h>
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
@@ -1144,6 +1145,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
+
+	mermap_mm_init(mm);
+
 	return mm;
 
 fail_pcpu:
@@ -1187,6 +1191,7 @@ static inline void __mmput(struct mm_struct *mm)
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
 	exit_mmap(mm);
+	mermap_mm_teardown(mm);
 	mm_put_huge_zero_folio(mm);
 	set_mm_exe_file(mm, NULL);
 	if (!list_empty(&mm->mmlist)) {
diff --git a/mm/Kconfig b/mm/Kconfig
index 2813059df9c1c..2bf1dbcc8cb10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1484,4 +1484,14 @@ config MM_LOCAL_REGION
 
 source "mm/damon/Kconfig"
 
+config ARCH_SUPPORTS_MERMAP
+	bool
+
+config MERMAP
+	bool "Support for epheMERal mappings within the kernel"
+	depends on ARCH_SUPPORTS_MERMAP
+	depends on MM_LOCAL_REGION
+	help
+	  Support for epheMERal mappings within the kernel.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ffd06cf7a04e6..0c45677f4a538 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
+obj-$(CONFIG_MERMAP) += mermap.o
diff --git a/mm/mermap.c b/mm/mermap.c
new file mode 100644
index 0000000000000..7cddc202755ee
--- /dev/null
+++ b/mm/mermap.c
@@ -0,0 +1,334 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/io.h>
+#include <linux/error-injection.h>
+#include <linux/mermap.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/pgtable.h>
+#include <linux/sched.h>
+
+#include <kunit/visibility.h>
+
+#include "internal.h"
+
+static inline int set_unmapped_pte(pte_t *ptep, unsigned long addr, void *data)
+{
+	set_pte(ptep, __pte(0));
+	return 0;
+}
+
+VISIBLE_IF_KUNIT void __mermap_put(struct mm_struct *mm, struct mermap_alloc *alloc)
+{
+	unsigned long size = PAGE_ALIGN(alloc->end - alloc->base);
+
+	if (WARN_ON_ONCE(!alloc->in_use))
+		return;
+
+	__apply_to_page_range(mm, alloc->base, size, set_unmapped_pte,
+			      NULL, PGRANGE_CREATE | PGRANGE_NOLOCK);
+
+	WRITE_ONCE(alloc->in_use, false);
+
+	migrate_enable();
+}
+EXPORT_SYMBOL_IF_KUNIT(__mermap_put);
+
+/* Return a region allocated by mermap_get(). */
+void mermap_put(struct mermap_alloc *alloc)
+{
+	__mermap_put(current->mm, alloc);
+}
+EXPORT_SYMBOL(mermap_put);
+
+VISIBLE_IF_KUNIT inline unsigned long mermap_cpu_base(int cpu)
+{
+	return MERMAP_BASE_ADDR + (cpu * MERMAP_CPU_REGION_SIZE);
+
+}
+EXPORT_SYMBOL_IF_KUNIT(mermap_cpu_base);
+
+/* Non-inclusive */
+VISIBLE_IF_KUNIT inline unsigned long mermap_cpu_end(int cpu)
+{
+	return MERMAP_BASE_ADDR + ((cpu + 1) * MERMAP_CPU_REGION_SIZE);
+
+}
+EXPORT_SYMBOL_IF_KUNIT(mermap_cpu_end);
+
+static inline void mermap_flush_tlb(int cpu, struct mermap_cpu *mc)
+{
+#if IS_ENABLED(CONFIG_MERMAP_KUNIT_TEST)
+	mc->tlb_flushes++;
+#endif
+	arch_mermap_flush_tlb();
+}
+
+/* Call with preemption disabled in use_reserve, else with migration disabled. */
+static inline struct mermap_alloc *mermap_alloc(struct mm_struct *mm,
+						unsigned long size, bool use_reserve)
+{
+	int cpu = raw_smp_processor_id();
+	struct mermap_cpu *mc = this_cpu_ptr(mm->mermap.cpu);
+	unsigned long cpu_end = mermap_cpu_end(cpu);
+	struct mermap_alloc *alloc = NULL;
+
+	/*
+	 * This is an extremely stupid allocator, there can only ever be a small
+	 * number of allocations so everything just works on linear search.
+	 *
+	 * Allocations are "in order", i.e. if the whole region is free it
+	 * allocates from the beginning. If there are any existing allocations
+	 * it allocates from right after the last (highest address) one. Any
+	 * free space before that goes unused.
+	 *
+	 * Once an allocation has been freed, the space it occupied must be flushed
+	 * from the TLB before it can be reused.
+	 *
+	 * Visual example of how this is suppose to behave (A for allocated, T for
+	 * TLB-flush-pending):
+	 *
+	 *  _______________ Start with everything free.
+	 *  AaaA___________ Allocate something.
+	 *  TttT___________ Free it. (Region needs a TLB flush now).
+	 *  TttTAaaaaaaaA__ Allocate something else.
+	 *  TttTAaaaaaaaAAA Allocate the remaining space.
+	 *  TttTTtttttttTAA Free the allocation before last.
+	 *  ^^^^^^^^^^^^^   This could all be reused now but for simplicity it
+	 *                  isn't. Another allocation at this point will fail.
+	 *  TttTTtttttttTTT Free the last allocation.
+	 *  _______________ Next time we allocate, first flush the TLB.
+	 *  AA_____________ Now we're back at the beginning.
+	 */
+
+	/* Keep one page for mermap_get_reserved(). */
+	if (use_reserve) {
+		if (WARN_ON_ONCE(size != PAGE_SIZE))
+			return NULL;
+		lockdep_assert_preemption_disabled();
+	} else {
+		cpu_end -= PAGE_SIZE;
+	}
+
+	if (WARN_ON_ONCE(!in_task()))
+		return NULL;
+	guard(preempt)();
+
+	/* Out of already-available space? */
+	if (mc->next_addr + size > cpu_end) {
+		unsigned long new_next = mermap_cpu_base(cpu);
+
+		/* Would we have space after a TLB flush? */
+		for (int i = 0; i < ARRAY_SIZE(mc->normal_allocs); i++) {
+			struct mermap_alloc *alloc = &mc->normal_allocs[i];
+
+			/*
+			 * The space between the uppermost allocated alloc->end
+			 * (or the base of the CPU's region if there are no
+			 * current allocations) and mc->next_addr has been
+			 * unmapped in the pagetables, but not flushed from the
+			 * TLB. Set new_next to point to the beginning of that
+			 * space.
+			 */
+			if (READ_ONCE(alloc->in_use))
+				new_next = max(new_next, alloc->end);
+		}
+		if (size > cpu_end - new_next)
+			return NULL;
+
+		mermap_flush_tlb(cpu, mc);
+		mc->next_addr = new_next;
+	}
+
+	/*
+	 * Find an alloc-tracking structure to use. Keep one for
+	 * mermap_get_reserved() - that should never be contended since it can
+	 * only be allocated with preemption off.
+	 */
+	if (WARN_ON_ONCE(mc->reserve_alloc.in_use))
+		return NULL;
+	if (use_reserve) {
+		alloc = &mc->reserve_alloc;
+	} else {
+		for (int i = 0; i < ARRAY_SIZE(mc->normal_allocs); i++) {
+			if (!READ_ONCE(mc->normal_allocs[i].in_use)) {
+				alloc = &mc->normal_allocs[i];
+				break;
+			}
+		}
+		if (!alloc)
+			return NULL;
+	}
+	alloc->in_use = true;
+	alloc->base = mc->next_addr;
+	alloc->end = alloc->base + size;
+	mc->next_addr += size;
+
+	return alloc;
+}
+
+struct set_pte_ctx {
+	pgprot_t prot;
+	unsigned long next_pfn;
+};
+
+static inline int do_set_pte(pte_t *pte, unsigned long addr, void *data)
+{
+	struct set_pte_ctx *ctx = data;
+
+	set_pte(pte, pfn_pte(ctx->next_pfn, ctx->prot));
+	ctx->next_pfn++;
+
+	return 0;
+}
+
+VISIBLE_IF_KUNIT struct mermap_alloc *
+__mermap_get(struct mm_struct *mm, struct page *page,
+	     unsigned long size, pgprot_t prot, bool use_reserve)
+{
+	struct mermap_alloc *alloc = NULL;
+	struct set_pte_ctx ctx;
+	int err;
+
+	if (size > MERMAP_CPU_REGION_SIZE || WARN_ON_ONCE(!mm || !mm->mermap.cpu))
+		return NULL;
+	if (WARN_ON_ONCE(!arch_mermap_pgprot_allowed(prot)))
+		return NULL;
+
+	size = PAGE_ALIGN(size);
+
+	migrate_disable();
+
+	alloc = mermap_alloc(mm, size, use_reserve);
+	if (!alloc) {
+		migrate_enable();
+		return NULL;
+	}
+
+	/* This probably wants to be optimised. */
+	ctx.prot = prot;
+	ctx.next_pfn = page_to_pfn(page);
+	err = __apply_to_page_range(mm, alloc->base, size, do_set_pte, &ctx,
+				    PGRANGE_CREATE | PGRANGE_NOLOCK);
+	if (err) {
+		WRITE_ONCE(alloc->in_use, false);
+		return NULL;
+	}
+
+	return alloc;
+}
+EXPORT_SYMBOL_IF_KUNIT(__mermap_get);
+
+/*
+ * Allocate a region of virtual memory, and map the page into it. This tries
+ * pretty hard to be fast but doesn't try very hard at all to actually succeed.
+ *
+ * The returned region is physically local to the current mm. It is _logically_
+ * local to the current CPU but this is not enforced by hardware so it can be
+ * exploited to mitigate CPU vulns. This means the caller must not map memory
+ * here that doesn't belong to the current process. The caller must also perform
+ * a full TLB flush of the region before freeing the pages that have been mapped
+ * here.
+ *
+ * This may only be called from process context, and the caller must arrange to
+ * first call mermap_mm_prepare(). (It would be possible to support this in IRQ,
+ * but it seems unlikely there's a valid usecase given the TLB flushing
+ * requirements). If it succeeds, it disables migration until you call
+ * mermap_put().
+ *
+ * This is guaranteed not to allocate.
+ *
+ * Use mermap_addr() to get the actual address of the mapped region.
+ */
+struct mermap_alloc *mermap_get(struct page *page, unsigned long size, pgprot_t prot)
+{
+	return __mermap_get(current->mm, page, size, prot, false);
+}
+EXPORT_SYMBOL(mermap_get);
+ALLOW_ERROR_INJECTION(mermap_get, NULL);
+
+/*
+ * Allocate a single PAGE_SIZE page via mermap_get(), requiring preemption to be
+ * off until it is freed. This always succeeds.
+ */
+void *mermap_get_reserved(struct page *page, pgprot_t prot)
+{
+	lockdep_assert_preemption_disabled();
+	return __mermap_get(current->mm, page, PAGE_SIZE, prot, true);
+}
+EXPORT_SYMBOL(mermap_get_reserved);
+
+/*
+ * Internal - do unconditional (cheap) setup that's done for every mm. This
+ * doesn't actually prepare the mermap for use until someone calls
+ * mermap_mm_prepare().
+ */
+void mermap_mm_init(struct mm_struct *mm)
+{
+	mutex_init(&mm->mermap.init_lock);
+}
+
+/*
+ * Set up the mermap for this mm. The caller doesn't need to call
+ * mermap_mm_teardown(), that's take care of by the normal mm teardown
+ * mechanism. This is idempotent and thread-safe.
+ */
+int mermap_mm_prepare(struct mm_struct *mm)
+{
+	int err = 0;
+	int cpu;
+
+	guard(mutex)(&mm->mermap.init_lock);
+
+	/* Already done? */
+	if (likely(mm->mermap.cpu))
+		return 0;
+
+	mm->mermap.cpu = alloc_percpu_gfp(struct mermap_cpu,
+					  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!mm->mermap.cpu)
+		return -ENOMEM;
+
+	/* So we can use this from the page allocator, preallocate pagetables. */
+	mm_flags_set(MMF_LOCAL_REGION_USED, mm);
+	for_each_possible_cpu(cpu) {
+		unsigned long base = mermap_cpu_base(cpu);
+
+		/* Note this pointlessly iterates over PTEs to initialise. */
+		err = apply_to_page_range(mm, base, MERMAP_CPU_REGION_SIZE,
+					  set_unmapped_pte, NULL);
+		if (err) {
+			/*
+			 * Clear .cpu now to inform mermap_ready(). Any partial
+			 * page tables get cleared up by mm teardown.
+			 */
+			free_percpu(mm->mermap.cpu);
+			mm->mermap.cpu = NULL;
+			break;
+		}
+		per_cpu_ptr(mm->mermap.cpu, cpu)->next_addr = base;
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(mermap_mm_prepare);
+
+/* Clean up mermap stuff on mm teardown. */
+void mermap_mm_teardown(struct mm_struct *mm)
+{
+	int cpu;
+
+	if (!mm->mermap.cpu)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct mermap_cpu *mc = this_cpu_ptr(mm->mermap.cpu);
+
+		for (int i = 0; i < ARRAY_SIZE(mc->normal_allocs); i++)
+			WARN_ON_ONCE(mc->normal_allocs[i].in_use);
+		WARN_ON_ONCE(mc->reserve_alloc.in_use);
+	}
+
+	free_percpu(mm->mermap.cpu);
+}
diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
index e9e879de8649b..51fc4668d7177 100644
--- a/mm/pgalloc-track.h
+++ b/mm/pgalloc-track.h
@@ -2,6 +2,12 @@
 #ifndef _LINUX_PGALLOC_TRACK_H
 #define _LINUX_PGALLOC_TRACK_H
 
+#include <linux/mm.h>
+#include <linux/pgalloc.h>
+#include <linux/pgtable.h>
+
+#include "internal.h"
+
 #if defined(CONFIG_MMU)
 static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd,
 				     unsigned long address,

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 07/22] mm: KUnit tests for the mermap
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (5 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-24  8:00   ` kernel test robot
  2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Some simple smoke-tests for the mermap. Mainly aiming to test:

1. That there aren't any silly off-by-ones.

2. That the pagetables are not completely broken.

3. That the TLB appears to get flushed basically when expected.

This last point requires a bit of ifdeffery to detect when the flushing
has been performed.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/mermap_types.h |   3 +
 mm/Kconfig                   |  11 ++
 mm/Makefile                  |   1 +
 mm/tests/mermap_kunit.c      | 250 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 265 insertions(+)

diff --git a/include/linux/mermap_types.h b/include/linux/mermap_types.h
index c1c83b223c28d..13110fcb4c387 100644
--- a/include/linux/mermap_types.h
+++ b/include/linux/mermap_types.h
@@ -24,6 +24,9 @@ struct mermap_cpu {
 	unsigned long next_addr;
 	struct mermap_alloc normal_allocs[3];
 	struct mermap_alloc reserve_alloc;
+#if IS_ENABLED(CONFIG_MERMAP_KUNIT_TEST)
+	u64 tlb_flushes;
+#endif
 };
 
 struct mermap {
diff --git a/mm/Kconfig b/mm/Kconfig
index 2bf1dbcc8cb10..e98db58d515fc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1494,4 +1494,15 @@ config MERMAP
 	help
 	  Support for epheMERal mappings within the kernel.
 
+config MERMAP_KUNIT_TEST
+	tristate "KUnit tests for the mermap" if !KUNIT_ALL_TESTS
+	depends on ARCH_SUPPORTS_MERMAP
+	depends on KUNIT
+	depends on MERMAP
+	default KUNIT_ALL_TESTS
+	help
+	  KUnit test for the mermap.
+
+	  If unsure, say N.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 0c45677f4a538..93a1756303cf9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -151,3 +151,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
 obj-$(CONFIG_MERMAP) += mermap.o
+obj-$(CONFIG_MERMAP_KUNIT_TEST) += tests/mermap_kunit.o
diff --git a/mm/tests/mermap_kunit.c b/mm/tests/mermap_kunit.c
new file mode 100644
index 0000000000000..4ac6bce2d75f7
--- /dev/null
+++ b/mm/tests/mermap_kunit.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/cacheflush.h>
+#include <linux/kthread.h>
+#include <linux/mermap.h>
+#include <linux/pgtable.h>
+
+#include <kunit/test.h>
+
+#define NR_NORMAL_ALLOCS ARRAY_SIZE(((struct mm_struct *)NULL)->mermap.cpu->normal_allocs)
+
+KUNIT_DEFINE_ACTION_WRAPPER(__free_page_wrapper, __free_page, struct page *);
+
+static inline struct page *alloc_page_wrapper(struct kunit *test, gfp_t gfp)
+{
+	struct page *page = alloc_page(gfp);
+
+	KUNIT_ASSERT_NOT_NULL(test, page);
+	KUNIT_ASSERT_EQ(test, kunit_add_action_or_reset(test, __free_page_wrapper, page), 0);
+	return page;
+}
+
+KUNIT_DEFINE_ACTION_WRAPPER(mmput_wrapper, mmput, struct mm_struct *);
+
+static inline struct mm_struct *mm_alloc_wrapper(struct kunit *test)
+{
+	struct mm_struct *mm = mm_alloc();
+
+	KUNIT_ASSERT_NOT_NULL(test, mm);
+	KUNIT_ASSERT_EQ(test, kunit_add_action_or_reset(test, mmput_wrapper, mm), 0);
+	return mm;
+}
+
+static inline struct mm_struct *get_mm(struct kunit *test)
+{
+	struct mm_struct *mm = mm_alloc_wrapper(test);
+
+	KUNIT_ASSERT_EQ(test, mermap_mm_prepare(mm), 0);
+	return mm;
+}
+
+struct __mermap_put_args {
+	struct mm_struct *mm;
+	struct mermap_alloc *alloc;
+	unsigned long size;
+};
+
+static inline void __mermap_put_wrapper(void *ctx)
+{
+	struct __mermap_put_args *args = (struct __mermap_put_args *)ctx;
+
+	__mermap_put(args->mm, args->alloc);
+}
+
+/* Call __mermap_get() with use_reserve=false, deal with cleanup. */
+static inline struct __mermap_put_args *
+__mermap_get_wrapper(struct kunit *test, struct mm_struct *mm,
+		     struct page *page, unsigned long size, pgprot_t prot)
+{
+	struct __mermap_put_args *args =
+		kunit_kmalloc(test, sizeof(struct __mermap_put_args), GFP_KERNEL);
+
+	KUNIT_ASSERT_NOT_NULL(test, args);
+	args->mm = mm;
+	args->alloc = __mermap_get(mm, page, size, prot, false);
+	args->size = size;
+
+	if (args->alloc) {
+		int err = kunit_add_action_or_reset(test, __mermap_put_wrapper, args);
+
+		KUNIT_ASSERT_EQ(test, err, 0);
+	}
+
+	return args;
+}
+
+/* Do the cleanup from __mermap_get_wrapper, now. */
+static inline void __mermap_put_early(struct kunit *test, struct __mermap_put_args *args)
+{
+	kunit_release_action(test, __mermap_put_wrapper, args);
+}
+
+static void test_basic_alloc(struct kunit *test)
+{
+	struct page *page = alloc_page_wrapper(test, GFP_KERNEL);
+	struct mm_struct *mm = get_mm(test);
+	struct __mermap_put_args *args;
+
+	args = __mermap_get_wrapper(test, mm, page, PAGE_SIZE, PAGE_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, args->alloc);
+}
+
+/* Dumb check for off-by-ones. */
+static void test_size(struct kunit *test)
+{
+	struct page *page = alloc_page_wrapper(test, GFP_KERNEL);
+	struct __mermap_put_args *full, *large, *small, *fail;
+	struct mm_struct *mm = get_mm(test);
+	unsigned long region_size, large_size;
+	struct mermap_alloc *alloc;
+	int cpu;
+
+	migrate_disable();
+	cpu = raw_smp_processor_id();
+	region_size = mermap_cpu_end(cpu) - mermap_cpu_base(cpu) - PAGE_SIZE;
+	large_size = region_size - PAGE_SIZE;
+
+	/* Allocate whole region at once. */
+	full = __mermap_get_wrapper(test, mm, page, region_size, PAGE_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, full->alloc);
+	__mermap_put_early(test, full);
+
+	/* Allocate larger than region size. */
+	fail = __mermap_get_wrapper(test, mm, page, region_size + PAGE_SIZE, PAGE_KERNEL);
+	KUNIT_ASSERT_NULL(test, fail->alloc);
+
+	/* Tiptoe up to the edge then past it. */
+	large = __mermap_get_wrapper(test, mm, page, large_size, PAGE_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, large->alloc);
+	small = __mermap_get_wrapper(test, mm, page, PAGE_SIZE, PAGE_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, small->alloc);
+	fail = __mermap_get_wrapper(test, mm, page, PAGE_SIZE, PAGE_KERNEL);
+	KUNIT_ASSERT_NULL(test, fail->alloc);
+
+	/* Can still allocate the reserved page. */
+	local_irq_disable();
+	alloc = __mermap_get(mm, page, PAGE_SIZE, PAGE_KERNEL, true);
+	local_irq_enable();
+	KUNIT_ASSERT_NOT_NULL(test, alloc);
+	__mermap_put(mm, alloc);
+}
+
+static void test_multiple_allocs(struct kunit *test)
+{
+	struct __mermap_put_args *argss[NR_NORMAL_ALLOCS] = { };
+	struct page *pages[NR_NORMAL_ALLOCS + 1];
+	struct mermap_alloc *reserved_alloc;
+	struct mm_struct *mm = get_mm(test);
+	int magic = 0xE4A4;
+
+	for (int i = 0; i < ARRAY_SIZE(pages); i++) {
+		pages[i] = alloc_page_wrapper(test, GFP_KERNEL);
+		WRITE_ONCE(*(int *)page_to_virt(pages[i]), magic + i);
+	}
+
+	for (int i = 0; i < ARRAY_SIZE(argss); i++) {
+		unsigned long base = mermap_cpu_base(raw_smp_processor_id());
+		unsigned long end = mermap_cpu_end(raw_smp_processor_id());
+		unsigned long addr;
+
+		argss[i] = __mermap_get_wrapper(test, mm, pages[i], PAGE_SIZE, PAGE_KERNEL);
+		KUNIT_ASSERT_NOT_NULL_MSG(test, argss[i], "alloc %d failed", i);
+
+		addr = (unsigned long) mermap_addr(argss[i]->alloc);
+		KUNIT_EXPECT_GE_MSG(test, addr, base, "alloc %d out of range", i);
+		KUNIT_EXPECT_LT_MSG(test, addr, end, "alloc %d out of range", i);
+	};
+
+	/*
+	 * Read through the mappings to try and detect if they point to the
+	 * pages we wrote earlier.
+	 */
+	kthread_use_mm(mm);
+	for (int i = 0; i < ARRAY_SIZE(pages) - 1; i++) {
+		int *ptr  = (int *)mermap_addr(argss[i]->alloc);
+
+		KUNIT_EXPECT_EQ(test, *ptr, magic + i);
+	}
+
+	/* Run out of alloc structures, only reserved allocs should succeed now. */
+	KUNIT_ASSERT_NULL(test, __mermap_get(mm, pages[NR_NORMAL_ALLOCS],
+					     PAGE_SIZE, PAGE_KERNEL, false));
+	preempt_disable();
+	reserved_alloc = __mermap_get(mm, pages[NR_NORMAL_ALLOCS],
+				      PAGE_SIZE, PAGE_KERNEL, true);
+	KUNIT_EXPECT_NOT_NULL(test, reserved_alloc);
+	/* Also check if this mapping seems correct*/
+	if (reserved_alloc) {
+		int *ptr  = (int *)mermap_addr(reserved_alloc);
+
+		KUNIT_EXPECT_EQ(test, *ptr, magic + NR_NORMAL_ALLOCS);
+
+		mermap_put(reserved_alloc);
+	}
+	preempt_enable();
+
+	kthread_unuse_mm(mm);
+}
+
+static void test_tlb_flushed(struct kunit *test)
+{
+	struct page *page = alloc_page_wrapper(test, GFP_KERNEL);
+	struct mm_struct *mm = get_mm(test);
+	unsigned long addr, prev_addr = 0;
+	/* Avoid running for ever in failure case. */
+	unsigned long max_iters = 1000000;
+	struct mermap_cpu *mc;
+
+	migrate_disable();
+	mc = this_cpu_ptr(mm->mermap.cpu);
+
+	/*
+	 * Allocate until we see an address less than what we had before - assume
+	 * that means a reuse.
+	 */
+	for (int i = 0; i < max_iters; i++) {
+		struct mermap_alloc *alloc;
+
+		/*
+		 * Obviously flushing the TLB already is not wrong per se, but
+		 * it's unexpected and probably means there's some bug.
+		 * Use ASSERT to avoid spamming the log in the failure case.
+		 */
+		KUNIT_ASSERT_EQ_MSG(test, mc->tlb_flushes, 0,
+				    "unexpected flush before alloc %d", i);
+
+		alloc = __mermap_get(mm, page, PAGE_SIZE, PAGE_KERNEL, false);
+		KUNIT_ASSERT_NOT_NULL_MSG(test, alloc, "alloc %d failed", i);
+
+		addr = (unsigned long)mermap_addr(alloc);
+		__mermap_put(mm, alloc);
+		if (addr < prev_addr)
+			break;
+
+		prev_addr = addr;
+		cond_resched();
+	}
+	KUNIT_ASSERT_TRUE_MSG(test, addr < prev_addr, "no address reuse");
+	/* Again, more than one flush isn't wrong per se, but probably a bug. */
+	KUNIT_ASSERT_EQ(test, mc->tlb_flushes, 1);
+
+	migrate_enable();
+}
+
+static struct kunit_case mermap_test_cases[] = {
+	KUNIT_CASE(test_basic_alloc),
+	KUNIT_CASE(test_size),
+	KUNIT_CASE(test_multiple_allocs),
+	KUNIT_CASE(test_tlb_flushed),
+	{}
+};
+
+static struct kunit_suite mermap_test_suite = {
+	.name = "mermap",
+	.test_cases = mermap_test_cases,
+};
+kunit_test_suite(mermap_test_suite);
+
+MODULE_DESCRIPTION("Mermap unit tests");
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 08/22] mm: introduce for_each_free_list()
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (6 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Later patches will rearrange the free areas, but there are a couple of
places that iterate over them with the assumption that they have the
current structure.

It seems ideally, code outside of mm should not be directly aware of
struct free_area in the first place, but that awareness seems relatively
harmless so just make the minimal change.

Now instead of letting users manually iterate over the free lists, just
provide a macro to do that. Then adopt that macro in a couple of places.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/mmzone.h  |  7 +++++--
 kernel/power/snapshot.c |  8 ++++----
 mm/mm_init.c            | 11 +++++++----
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7bd0134c241ce..c49e3cdf4f6bb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -177,9 +177,12 @@ static inline bool migratetype_is_mergeable(int mt)
 	return mt < MIGRATE_PCPTYPES;
 }
 
-#define for_each_migratetype_order(order, type) \
+#define for_each_free_list(list, zone, order) \
 	for (order = 0; order < NR_PAGE_ORDERS; order++) \
-		for (type = 0; type < MIGRATE_TYPES; type++)
+		for (unsigned int type = 0; \
+		     list = &zone->free_area[order].free_list[type], \
+		     type < MIGRATE_TYPES; \
+		     type++) \
 
 extern int page_group_by_mobility_disabled;
 
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 7dcccf378cc2f..abd33ca13eec4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1245,8 +1245,9 @@ unsigned int snapshot_additional_pages(struct zone *zone)
 static void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn, page_count = WD_PAGE_COUNT;
+	struct list_head *free_list;
 	unsigned long flags;
-	unsigned int order, t;
+	unsigned int order;
 	struct page *page;
 
 	if (zone_is_empty(zone))
@@ -1270,9 +1271,8 @@ static void mark_free_pages(struct zone *zone)
 			swsusp_unset_page_free(page);
 	}
 
-	for_each_migratetype_order(order, t) {
-		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], buddy_list) {
+	for_each_free_list(free_list, zone, order) {
+		list_for_each_entry(page, free_list, buddy_list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 969048f9b320c..f6f9455bc42b6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1445,11 +1445,14 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
 
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
-	unsigned int order, t;
-	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+	struct list_head *list;
+	unsigned int order;
+
+	for_each_free_list(list, zone, order)
+		INIT_LIST_HEAD(list);
+
+	for (order = 0; order < NR_PAGE_ORDERS; order++)
 		zone->free_area[order].nr_free = 0;
-	}
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	INIT_LIST_HEAD(&zone->unaccepted_pages);

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback()
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (7 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

This function currently returns a signed integer that encodes status
in-band, as negative numbers, along with a migratetype.

This function is about to be updated to a mode where this in-band
signaling no longer makes sense. Therefore, switch to a more
explicit/verbose style that encodes the status and migratetype
separately.

In the spirit of making things more explicit, also create an enum to
avoid using magic integer literals with special meanings. This enables
documenting the values at their definition instead of in one of the
callers.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/compaction.c |  3 ++-
 mm/internal.h   | 14 +++++++++++---
 mm/page_alloc.c | 40 +++++++++++++++++++++++-----------------
 3 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 32623894a6327..25371a75471dd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2358,7 +2358,8 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 		 * Job done if allocation would steal freepages from
 		 * other migratetype buddy lists.
 		 */
-		if (find_suitable_fallback(area, order, migratetype, true) >= 0)
+		if (find_suitable_fallback(area, order, migratetype, true, NULL)
+		    == FALLBACK_FOUND)
 			/*
 			 * Movable pages are OK in any pageblock. If we are
 			 * stealing for a non-movable allocation, make sure
diff --git a/mm/internal.h b/mm/internal.h
index f4c59534670e4..e3782721a588b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1059,9 +1059,17 @@ static inline void init_cma_pageblock(struct page *page)
 }
 #endif
 
-
-int find_suitable_fallback(struct free_area *area, unsigned int order,
-			   int migratetype, bool claimable);
+enum fallback_result {
+	/* Found suitable migratetype, *mt_out is valid. */
+	FALLBACK_FOUND,
+	/* No fallback found in requested order. */
+	FALLBACK_EMPTY,
+	/* Passed @claimable, but claiming whole block is a bad idea. */
+	FALLBACK_NOCLAIM,
+};
+enum fallback_result
+find_suitable_fallback(struct free_area *area, unsigned int order,
+		       int migratetype, bool claimable, unsigned int *mt_out);
 
 static inline bool free_area_empty(struct free_area *area, int migratetype)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a1dc7866068f..ac077d98019f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2248,25 +2248,29 @@ static bool should_try_claim_block(unsigned int order, int start_mt)
  * we would do this whole-block claiming. This would help to reduce
  * fragmentation due to mixed migratetype pages in one pageblock.
  */
-int find_suitable_fallback(struct free_area *area, unsigned int order,
-			   int migratetype, bool claimable)
+enum fallback_result
+find_suitable_fallback(struct free_area *area, unsigned int order,
+		       int migratetype, bool claimable, unsigned int *mt_out)
 {
 	int i;
 
 	if (claimable && !should_try_claim_block(order, migratetype))
-		return -2;
+		return FALLBACK_NOCLAIM;
 
 	if (area->nr_free == 0)
-		return -1;
+		return FALLBACK_EMPTY;
 
 	for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) {
 		int fallback_mt = fallbacks[migratetype][i];
 
-		if (!free_area_empty(area, fallback_mt))
-			return fallback_mt;
+		if (!free_area_empty(area, fallback_mt)) {
+			if (mt_out)
+				*mt_out = fallback_mt;
+			return FALLBACK_FOUND;
+		}
 	}
 
-	return -1;
+	return FALLBACK_EMPTY;
 }
 
 /*
@@ -2376,16 +2380,16 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 	 */
 	for (current_order = MAX_PAGE_ORDER; current_order >= min_order;
 				--current_order) {
-		area = &(zone->free_area[current_order]);
-		fallback_mt = find_suitable_fallback(area, current_order,
-						     start_migratetype, true);
+		enum fallback_result result;
 
-		/* No block in that order */
-		if (fallback_mt == -1)
+		area = &(zone->free_area[current_order]);
+		result = find_suitable_fallback(area, current_order,
+						start_migratetype, true, &fallback_mt);
+
+		if (result == FALLBACK_EMPTY)
 			continue;
 
-		/* Advanced into orders too low to claim, abort */
-		if (fallback_mt == -2)
+		if (result == FALLBACK_NOCLAIM)
 			break;
 
 		page = get_page_from_free_area(area, fallback_mt);
@@ -2415,10 +2419,12 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 	int fallback_mt;
 
 	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
+		enum fallback_result result;
+
 		area = &(zone->free_area[current_order]);
-		fallback_mt = find_suitable_fallback(area, current_order,
-						     start_migratetype, false);
-		if (fallback_mt == -1)
+		result = find_suitable_fallback(area, current_order, start_migratetype,
+						false, &fallback_mt);
+		if (result == FALLBACK_EMPTY)
 			continue;
 
 		page = get_page_from_free_area(area, fallback_mt);

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 10/22] mm: introduce freetype_t
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (8 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

This is preparation for teaching the page allocator to break up free
pages according to properties that have nothing to do with mobility. For
example it can be used to allocate pages that are non-present in the
physmap, or pages that are sensitive in ASI.

For these usecases, certain allocator behaviours are desirable:

- A "pool" of pages with the given property is usually available, so
  that pages can be provided with the correct sensitivity without
  zeroing/TLB flushing.

- Pages are physically grouped by the property, so that large
  allocations rarely have to alter the pagetables due to ASI.

- The properties can be forced to vary only at a certain fixed address
  granularity, so that the pagetables can all be pre-allocated. This is
  desirable because the page allocator will be changing mappings:
  pre-allocation is a straightforward way to avoid recursive allocations
  (of pagetables).

It seems that the existing infrastructure for grouping pages by
mobility, i.e. pageblocks and migratetypes, serves this purpose pretty
nicely. However, overloading migratetype itself for this purpose looks
like a road to maintenance hell. In particular, as soon as such
properties become orthogonal to migratetypes, it would start to require
"doubling" the migratetypes.

Therefore, introduce a new higher-level concept, called "freetype"
(because it is used to index "free"lists) that can encode extra
properties, orthogonally to mobility, via flags.

Since freetypes and migratetypes would be very easy to mix up, freetypes
are (at least for now) stored in a struct typedef similar to atomic_t.
This provides type-safety, but comes at the expense of being pretty
annoying to code with. For instance, freetype_t cannot be compared with
the == operator. Once this code matures, if the freetype/migratetype
distinction gets less confusing, it might be wise to drop this
struct and just use ints.

Because this will eventually be needed from pageblock-flags.h, put this
in its own header instead of directly in mmzone.h.

To try and reduce review pain for such a churny patch, first introduce
freetypes as nothing but an indirection over migratetypes. The helpers
concerned with the flags are defined, but only as stubs. Convert
everything over to using freetypes wherever they are needed to index
freelists, but maintain references to migratetypes in code that really
only cares specifically about mobility.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/freetype.h |  38 +++++
 include/linux/gfp.h      |  16 +-
 include/linux/mmzone.h   |  49 +++++-
 mm/compaction.c          |  35 +++--
 mm/internal.h            |  17 ++-
 mm/page_alloc.c          | 390 +++++++++++++++++++++++++++++------------------
 mm/page_isolation.c      |   2 +-
 mm/page_owner.c          |   7 +-
 mm/page_reporting.c      |   4 +-
 mm/show_mem.c            |   4 +-
 10 files changed, 371 insertions(+), 191 deletions(-)

diff --git a/include/linux/freetype.h b/include/linux/freetype.h
new file mode 100644
index 0000000000000..9f857d10bb5db
--- /dev/null
+++ b/include/linux/freetype.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_FREETYPE_H
+#define _LINUX_FREETYPE_H
+
+#include <linux/types.h>
+
+/*
+ * A freetype is the index used to identify free lists. This consists of a
+ * migratetype, and other bits which encode orthogonal properties of memory.
+ */
+typedef struct {
+	int migratetype;
+} freetype_t;
+
+/*
+ * Return a dense linear index for freetypes that have lists in the free area.
+ * Return -1 for other freetypes.
+ */
+static inline int freetype_idx(freetype_t freetype)
+{
+	return freetype.migratetype;
+}
+
+/* No freetype flags actually exist yet. */
+#define NR_FREETYPE_IDXS MIGRATE_TYPES
+
+static inline unsigned int freetype_flags(freetype_t freetype)
+{
+	/* No flags supported yet. */
+	return 0;
+}
+
+static inline bool freetypes_equal(freetype_t a, freetype_t b)
+{
+	return a.migratetype == b.migratetype;
+}
+
+#endif /* _LINUX_FREETYPE_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51ef13ed756eb..34a38c420e84a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -21,8 +21,10 @@ struct mempolicy;
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 #define GFP_MOVABLE_SHIFT 3
 
-static inline int gfp_migratetype(const gfp_t gfp_flags)
+static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
 {
+	int migratetype;
+
 	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
 	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
@@ -30,11 +32,15 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
 	BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >>
 		      GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC);
 
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
+	if (unlikely(page_group_by_mobility_disabled)) {
+		migratetype = MIGRATE_UNMOVABLE;
+	} else {
+		/* Group based on mobility */
+		migratetype = (__force unsigned long)(gfp_flags & GFP_MOVABLE_MASK)
+			>> GFP_MOVABLE_SHIFT;
+	}
 
-	/* Group based on mobility */
-	return (__force unsigned long)(gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
+	return migrate_to_freetype(migratetype, 0);
 }
 #undef GFP_MOVABLE_MASK
 #undef GFP_MOVABLE_SHIFT
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c49e3cdf4f6bb..c456ddd1f5979 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -5,6 +5,7 @@
 #ifndef __ASSEMBLY__
 #ifndef __GENERATING_BOUNDS_H
 
+#include <linux/freetype.h>
 #include <linux/spinlock.h>
 #include <linux/list.h>
 #include <linux/list_nulls.h>
@@ -179,24 +180,62 @@ static inline bool migratetype_is_mergeable(int mt)
 
 #define for_each_free_list(list, zone, order) \
 	for (order = 0; order < NR_PAGE_ORDERS; order++) \
-		for (unsigned int type = 0; \
-		     list = &zone->free_area[order].free_list[type], \
-		     type < MIGRATE_TYPES; \
-		     type++) \
+		for (unsigned int idx = 0; \
+		     list = &zone->free_area[order].free_list[idx], \
+		     idx < NR_FREETYPE_IDXS; \
+		     idx++)
+
+static inline freetype_t migrate_to_freetype(enum migratetype mt,
+					     unsigned int flags)
+{
+	freetype_t freetype;
+
+	/* No flags supported yet. */
+	VM_WARN_ON_ONCE(flags);
+
+	freetype.migratetype = mt;
+	return freetype;
+}
+
+static inline enum migratetype free_to_migratetype(freetype_t freetype)
+{
+	return freetype.migratetype;
+}
+
+/* Convenience helper, return the freetype modified to have the migratetype. */
+static inline freetype_t freetype_with_migrate(freetype_t freetype,
+					       enum migratetype migratetype)
+{
+	return migrate_to_freetype(migratetype, freetype_flags(freetype));
+}
 
 extern int page_group_by_mobility_disabled;
 
+freetype_t get_pfnblock_freetype(const struct page *page, unsigned long pfn);
+
 #define get_pageblock_migratetype(page) \
 	get_pfnblock_migratetype(page, page_to_pfn(page))
 
+#define get_pageblock_freetype(page) \
+	get_pfnblock_freetype(page, page_to_pfn(page))
+
 #define folio_migratetype(folio) \
 	get_pageblock_migratetype(&folio->page)
 
 struct free_area {
-	struct list_head	free_list[MIGRATE_TYPES];
+	struct list_head	free_list[NR_FREETYPE_IDXS];
 	unsigned long		nr_free;
 };
 
+static inline
+struct list_head *free_area_list(struct free_area *area, freetype_t type)
+{
+	int idx = freetype_idx(type);
+
+	VM_BUG_ON(idx < 0);
+	return &area->free_list[idx];
+}
+
 struct pglist_data;
 
 #ifdef CONFIG_NUMA
diff --git a/mm/compaction.c b/mm/compaction.c
index 25371a75471dd..5630514191936 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1394,7 +1394,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 static bool suitable_migration_source(struct compact_control *cc,
 							struct page *page)
 {
-	int block_mt;
+	freetype_t block_ft;
 
 	if (pageblock_skip_persistent(page))
 		return false;
@@ -1402,12 +1402,12 @@ static bool suitable_migration_source(struct compact_control *cc,
 	if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
 		return true;
 
-	block_mt = get_pageblock_migratetype(page);
+	block_ft = get_pageblock_freetype(page);
 
-	if (cc->migratetype == MIGRATE_MOVABLE)
-		return is_migrate_movable(block_mt);
+	if (free_to_migratetype(cc->freetype) == MIGRATE_MOVABLE)
+		return is_migrate_movable(free_to_migratetype(block_ft));
 	else
-		return block_mt == cc->migratetype;
+		return freetypes_equal(block_ft, cc->freetype);
 }
 
 /* Returns true if the page is within a block suitable for migration to */
@@ -1998,7 +1998,8 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 	 * reduces the risk that a large movable pageblock is freed for
 	 * an unmovable/reclaimable small allocation.
 	 */
-	if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE)
+	if (cc->direct_compaction &&
+	    free_to_migratetype(cc->freetype) != MIGRATE_MOVABLE)
 		return pfn;
 
 	/*
@@ -2269,7 +2270,7 @@ static bool should_proactive_compact_node(pg_data_t *pgdat)
 static enum compact_result __compact_finished(struct compact_control *cc)
 {
 	unsigned int order;
-	const int migratetype = cc->migratetype;
+	const freetype_t freetype = cc->freetype;
 	int ret;
 
 	/* Compaction run completes if the migrate and free scanner meet */
@@ -2344,25 +2345,27 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 	for (order = cc->order; order < NR_PAGE_ORDERS; order++) {
 		struct free_area *area = &cc->zone->free_area[order];
 
-		/* Job done if page is free of the right migratetype */
-		if (!free_area_empty(area, migratetype))
+		/* Job done if page is free of the right freetype */
+		if (!free_area_empty(area, freetype))
 			return COMPACT_SUCCESS;
 
 #ifdef CONFIG_CMA
 		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
-		if (migratetype == MIGRATE_MOVABLE &&
-			!free_area_empty(area, MIGRATE_CMA))
+		if (free_to_migratetype(freetype) == MIGRATE_MOVABLE &&
+		    !free_area_empty(area, freetype_with_migrate(cc->freetype,
+								 MIGRATE_CMA)))
 			return COMPACT_SUCCESS;
 #endif
 		/*
 		 * Job done if allocation would steal freepages from
-		 * other migratetype buddy lists.
+		 * other freetype buddy lists.
 		 */
-		if (find_suitable_fallback(area, order, migratetype, true, NULL)
+		if (find_suitable_fallback(area, order, freetype, true, NULL)
 		    == FALLBACK_FOUND)
 			/*
-			 * Movable pages are OK in any pageblock. If we are
-			 * stealing for a non-movable allocation, make sure
+			 * Movable pages are OK in any pageblock of the right
+			 * sensitivity. If we are * stealing for a
+			 * non-movable allocation, make sure
 			 * we finish compacting the current pageblock first
 			 * (which is assured by the above migrate_pfn align
 			 * check) so it is as free as possible and we won't
@@ -2567,7 +2570,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 		INIT_LIST_HEAD(&cc->freepages[order]);
 	INIT_LIST_HEAD(&cc->migratepages);
 
-	cc->migratetype = gfp_migratetype(cc->gfp_mask);
+	cc->freetype = gfp_freetype(cc->gfp_mask);
 
 	if (!is_via_compact_memory(cc->order)) {
 		ret = compaction_suit_allocation_order(cc->zone, cc->order,
diff --git a/mm/internal.h b/mm/internal.h
index e3782721a588b..d929274d73b92 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -10,6 +10,7 @@
 #include <linux/fs.h>
 #include <linux/khugepaged.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/pagemap.h>
@@ -681,7 +682,7 @@ struct alloc_context {
 	struct zonelist *zonelist;
 	nodemask_t *nodemask;
 	struct zoneref *preferred_zoneref;
-	int migratetype;
+	freetype_t freetype;
 
 	/*
 	 * highest_zoneidx represents highest usable zone index of
@@ -832,8 +833,8 @@ static inline void clear_zone_contiguous(struct zone *zone)
 }
 
 extern int __isolate_free_page(struct page *page, unsigned int order);
-extern void __putback_isolated_page(struct page *page, unsigned int order,
-				    int mt);
+void __putback_isolated_page(struct page *page, unsigned int order,
+			     freetype_t freetype);
 extern void memblock_free_pages(unsigned long pfn, unsigned int order);
 extern void __free_pages_core(struct page *page, unsigned int order,
 		enum meminit_context context);
@@ -999,7 +1000,7 @@ struct compact_control {
 	short search_order;		/* order to start a fast search at */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	int order;			/* order a direct compactor needs */
-	int migratetype;		/* migratetype of direct compactor */
+	freetype_t freetype;		/* freetype of direct compactor */
 	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
 	const int highest_zoneidx;	/* zone index of a direct compactor */
 	enum migrate_mode mode;		/* Async or sync migration mode */
@@ -1060,7 +1061,7 @@ static inline void init_cma_pageblock(struct page *page)
 #endif
 
 enum fallback_result {
-	/* Found suitable migratetype, *mt_out is valid. */
+	/* Found suitable fallback, *ft_out is valid. */
 	FALLBACK_FOUND,
 	/* No fallback found in requested order. */
 	FALLBACK_EMPTY,
@@ -1069,11 +1070,11 @@ enum fallback_result {
 };
 enum fallback_result
 find_suitable_fallback(struct free_area *area, unsigned int order,
-		       int migratetype, bool claimable, unsigned int *mt_out);
+		       freetype_t freetype, bool claimable, freetype_t *ft_out);
 
-static inline bool free_area_empty(struct free_area *area, int migratetype)
+static inline bool free_area_empty(struct free_area *area, freetype_t freetype)
 {
-	return list_empty(&area->free_list[migratetype]);
+	return list_empty(free_area_list(area, freetype));
 }
 
 /* mm/util.c */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac077d98019f3..018622aa19006 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -422,6 +422,37 @@ bool get_pfnblock_bit(const struct page *page, unsigned long pfn,
 	return test_bit(bitidx + pb_bit, bitmap_word);
 }
 
+/**
+ * __get_pfnblock_freetype - Return the freetype of a pageblock, optionally
+ * ignoring the fact that it's currently isolated.
+ * @page: The page within the block of interest
+ * @pfn: The target page frame number
+ * @ignore_iso: If isolated, return the migratetype that the block had before
+ *              isolation.
+ */
+__always_inline freetype_t
+__get_pfnblock_freetype(const struct page *page, unsigned long pfn,
+			bool ignore_iso)
+{
+	int mt = get_pfnblock_migratetype(page, pfn);
+
+	return migrate_to_freetype(mt, 0);
+}
+
+/**
+ * get_pfnblock_migratetype - Return the freetype of a pageblock
+ * @page: The page within the block of interest
+ * @pfn: The target page frame number
+ *
+ * Return: The freetype of the pageblock
+ */
+__always_inline freetype_t
+get_pfnblock_freetype(const struct page *page, unsigned long pfn)
+{
+	return __get_pfnblock_freetype(page, pfn, 0);
+}
+
+
 /**
  * get_pfnblock_migratetype - Return the migratetype of a pageblock
  * @page: The page within the block of interest
@@ -733,8 +764,11 @@ static inline struct capture_control *task_capc(struct zone *zone)
 
 static inline bool
 compaction_capture(struct capture_control *capc, struct page *page,
-		   int order, int migratetype)
+		   int order, freetype_t freetype)
 {
+	enum migratetype migratetype = free_to_migratetype(freetype);
+	enum migratetype capc_mt;
+
 	if (!capc || order != capc->cc->order)
 		return false;
 
@@ -743,6 +777,8 @@ compaction_capture(struct capture_control *capc, struct page *page,
 	    is_migrate_isolate(migratetype))
 		return false;
 
+	capc_mt = free_to_migratetype(capc->cc->freetype);
+
 	/*
 	 * Do not let lower order allocations pollute a movable pageblock
 	 * unless compaction is also requesting movable pages.
@@ -751,12 +787,12 @@ compaction_capture(struct capture_control *capc, struct page *page,
 	 * have trouble finding a high-order free page.
 	 */
 	if (order < pageblock_order && migratetype == MIGRATE_MOVABLE &&
-	    capc->cc->migratetype != MIGRATE_MOVABLE)
+	    capc_mt != MIGRATE_MOVABLE)
 		return false;
 
-	if (migratetype != capc->cc->migratetype)
+	if (migratetype != capc_mt)
 		trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
-					    capc->cc->migratetype, migratetype);
+					    capc_mt, migratetype);
 
 	capc->page = page;
 	return true;
@@ -770,7 +806,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
 
 static inline bool
 compaction_capture(struct capture_control *capc, struct page *page,
-		   int order, int migratetype)
+		   int order, freetype_t freetype)
 {
 	return false;
 }
@@ -795,23 +831,28 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
 
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
-				      unsigned int order, int migratetype,
+				      unsigned int order, freetype_t freetype,
 				      bool tail)
 {
 	struct free_area *area = &zone->free_area[order];
 	int nr_pages = 1 << order;
 
-	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
-		     "page type is %d, passed migratetype is %d (nr=%d)\n",
-		     get_pageblock_migratetype(page), migratetype, nr_pages);
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		freetype_t block_ft = get_pageblock_freetype(page);
+
+		VM_WARN_ONCE(!freetypes_equal(block_ft, freetype),
+				"page type is %d/%#x, passed type is %d/%3x (nr=%d)\n",
+				block_ft.migratetype, freetype_flags(block_ft),
+				freetype.migratetype, freetype_flags(freetype), nr_pages);
+	}
 
 	if (tail)
-		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+		list_add_tail(&page->buddy_list, free_area_list(area, freetype));
 	else
-		list_add(&page->buddy_list, &area->free_list[migratetype]);
+		list_add(&page->buddy_list, free_area_list(area, freetype));
 	area->nr_free++;
 
-	if (order >= pageblock_order && !is_migrate_isolate(migratetype))
+	if (order >= pageblock_order && !is_migrate_isolate(free_to_migratetype(freetype)))
 		__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
 }
 
@@ -821,17 +862,25 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
  * allocation again (e.g., optimization for memory onlining).
  */
 static inline void move_to_free_list(struct page *page, struct zone *zone,
-				     unsigned int order, int old_mt, int new_mt)
+				     unsigned int order,
+				     freetype_t old_ft, freetype_t new_ft)
 {
 	struct free_area *area = &zone->free_area[order];
+	int old_mt = free_to_migratetype(old_ft);
+	int new_mt = free_to_migratetype(new_ft);
 	int nr_pages = 1 << order;
 
 	/* Free page moving can fail, so it happens before the type update */
-	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
-		     "page type is %d, passed migratetype is %d (nr=%d)\n",
-		     get_pageblock_migratetype(page), old_mt, nr_pages);
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		freetype_t block_ft = get_pageblock_freetype(page);
 
-	list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+		VM_WARN_ONCE(!freetypes_equal(block_ft, old_ft),
+				"page type is %d/%#x, passed freetype is %d/%#x (nr=%d)\n",
+				block_ft.migratetype, freetype_flags(block_ft),
+				old_ft.migratetype, freetype_flags(old_ft), nr_pages);
+	}
+
+	list_move_tail(&page->buddy_list, free_area_list(area, new_ft));
 
 	account_freepages(zone, -nr_pages, old_mt);
 	account_freepages(zone, nr_pages, new_mt);
@@ -874,9 +923,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 }
 
 static inline struct page *get_page_from_free_area(struct free_area *area,
-					    int migratetype)
+						   freetype_t freetype)
 {
-	return list_first_entry_or_null(&area->free_list[migratetype],
+	return list_first_entry_or_null(free_area_list(area, freetype),
 					struct page, buddy_list);
 }
 
@@ -943,9 +992,10 @@ static void change_pageblock_range(struct page *pageblock_page,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype, fpi_t fpi_flags)
+		freetype_t freetype, fpi_t fpi_flags)
 {
 	struct capture_control *capc = task_capc(zone);
+	int migratetype = free_to_migratetype(freetype);
 	unsigned long buddy_pfn = 0;
 	unsigned long combined_pfn;
 	struct page *buddy;
@@ -954,16 +1004,17 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP, page);
 
-	VM_BUG_ON(migratetype == -1);
+	VM_BUG_ON(freetype.migratetype == -1);
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
 	account_freepages(zone, 1 << order, migratetype);
 
 	while (order < MAX_PAGE_ORDER) {
-		int buddy_mt = migratetype;
+		freetype_t buddy_ft = freetype;
+		enum migratetype buddy_mt = free_to_migratetype(buddy_ft);
 
-		if (compaction_capture(capc, page, order, migratetype)) {
+		if (compaction_capture(capc, page, order, freetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -979,7 +1030,8 @@ static inline void __free_one_page(struct page *page,
 			 * pageblock isolation could cause incorrect freepage or CMA
 			 * accounting or HIGHATOMIC accounting.
 			 */
-			buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
+			buddy_ft = get_pfnblock_freetype(buddy, buddy_pfn);
+			buddy_mt = free_to_migratetype(buddy_ft);
 
 			if (migratetype != buddy_mt &&
 			    (!migratetype_is_mergeable(migratetype) ||
@@ -1021,7 +1073,7 @@ static inline void __free_one_page(struct page *page,
 	else
 		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
-	__add_to_free_list(page, zone, order, migratetype, to_tail);
+	__add_to_free_list(page, zone, order, freetype, to_tail);
 
 	/* Notify page reporting subsystem of freed page */
 	if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
@@ -1485,19 +1537,20 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		nr_pages = 1 << order;
 		do {
 			unsigned long pfn;
-			int mt;
+			freetype_t ft;
 
 			page = list_last_entry(list, struct page, pcp_list);
 			pfn = page_to_pfn(page);
-			mt = get_pfnblock_migratetype(page, pfn);
+			ft = get_pfnblock_freetype(page, pfn);
 
 			/* must delete to avoid corrupting pcp list */
 			list_del(&page->pcp_list);
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
-			trace_mm_page_pcpu_drain(page, order, mt);
+			__free_one_page(page, pfn, zone, order, ft, FPI_NONE);
+			trace_mm_page_pcpu_drain(page, order,
+						 free_to_migratetype(ft));
 		} while (count > 0 && !list_empty(list));
 	}
 
@@ -1518,9 +1571,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 		order = pageblock_order;
 
 	do {
-		int mt = get_pfnblock_migratetype(page, pfn);
+		freetype_t ft = get_pfnblock_freetype(page, pfn);
 
-		__free_one_page(page, pfn, zone, order, mt, fpi);
+		__free_one_page(page, pfn, zone, order, ft, fpi);
 		pfn += 1 << order;
 		if (pfn == end)
 			break;
@@ -1698,7 +1751,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
  * -- nyc
  */
 static inline unsigned int expand(struct zone *zone, struct page *page, int low,
-				  int high, int migratetype)
+				  int high, freetype_t freetype)
 {
 	unsigned int size = 1 << high;
 	unsigned int nr_added = 0;
@@ -1717,7 +1770,7 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		__add_to_free_list(&page[size], zone, high, migratetype, false);
+		__add_to_free_list(&page[size], zone, high, freetype, false);
 		set_buddy_order(&page[size], high);
 		nr_added += size;
 	}
@@ -1727,12 +1780,13 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 
 static __always_inline void page_del_and_expand(struct zone *zone,
 						struct page *page, int low,
-						int high, int migratetype)
+						int high, freetype_t freetype)
 {
+	enum migratetype migratetype = free_to_migratetype(freetype);
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
-	nr_pages -= expand(zone, page, low, high, migratetype);
+	nr_pages -= expand(zone, page, low, high, freetype);
 	account_freepages(zone, -nr_pages, migratetype);
 }
 
@@ -1885,7 +1939,7 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
  */
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
-						int migratetype)
+						freetype_t freetype)
 {
 	unsigned int current_order;
 	struct free_area *area;
@@ -1893,13 +1947,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
+		enum migratetype migratetype = free_to_migratetype(freetype);
+
 		area = &(zone->free_area[current_order]);
-		page = get_page_from_free_area(area, migratetype);
+		page = get_page_from_free_area(area, freetype);
 		if (!page)
 			continue;
 
 		page_del_and_expand(zone, page, order, current_order,
-				    migratetype);
+				    freetype);
 		trace_mm_page_alloc_zone_locked(page, order, migratetype,
 				pcp_allowed_order(order) &&
 				migratetype < MIGRATE_PCPTYPES);
@@ -1924,13 +1980,18 @@ static int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = {
 
 #ifdef CONFIG_CMA
 static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
-					unsigned int order)
+					unsigned int order, unsigned int ft_flags)
 {
-	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
+	freetype_t freetype = migrate_to_freetype(MIGRATE_CMA, ft_flags);
+
+	return __rmqueue_smallest(zone, order, freetype);
 }
 #else
 static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
-					unsigned int order) { return NULL; }
+					unsigned int order, bool sensitive)
+{
+	return NULL;
+}
 #endif
 
 /*
@@ -1938,7 +1999,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
  * change the block type.
  */
 static int __move_freepages_block(struct zone *zone, unsigned long start_pfn,
-				  int old_mt, int new_mt)
+				  freetype_t old_ft, freetype_t new_ft)
 {
 	struct page *page;
 	unsigned long pfn, end_pfn;
@@ -1961,7 +2022,7 @@ static int __move_freepages_block(struct zone *zone, unsigned long start_pfn,
 
 		order = buddy_order(page);
 
-		move_to_free_list(page, zone, order, old_mt, new_mt);
+		move_to_free_list(page, zone, order, old_ft, new_ft);
 
 		pfn += 1 << order;
 		pages_moved += 1 << order;
@@ -2021,7 +2082,7 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 }
 
 static int move_freepages_block(struct zone *zone, struct page *page,
-				int old_mt, int new_mt)
+				freetype_t old_ft, freetype_t new_ft)
 {
 	unsigned long start_pfn;
 	int res;
@@ -2029,8 +2090,11 @@ static int move_freepages_block(struct zone *zone, struct page *page,
 	if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL))
 		return -1;
 
-	res = __move_freepages_block(zone, start_pfn, old_mt, new_mt);
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+	VM_BUG_ON(freetype_flags(old_ft) != freetype_flags(new_ft));
+
+	res = __move_freepages_block(zone, start_pfn, old_ft, new_ft);
+	set_pageblock_migratetype(pfn_to_page(start_pfn),
+				  free_to_migratetype(new_ft));
 
 	return res;
 
@@ -2098,8 +2162,7 @@ static bool __move_freepages_block_isolate(struct zone *zone,
 		struct page *page, bool isolate)
 {
 	unsigned long start_pfn, buddy_pfn;
-	int from_mt;
-	int to_mt;
+	freetype_t from_ft, to_ft;
 	struct page *buddy;
 
 	if (isolate == get_pageblock_isolate(page)) {
@@ -2129,18 +2192,15 @@ static bool __move_freepages_block_isolate(struct zone *zone,
 	}
 
 move:
-	/* Use MIGRATETYPE_MASK to get non-isolate migratetype */
 	if (isolate) {
-		from_mt = __get_pfnblock_flags_mask(page, page_to_pfn(page),
-						    MIGRATETYPE_MASK);
-		to_mt = MIGRATE_ISOLATE;
+		from_ft = __get_pfnblock_freetype(page, page_to_pfn(page), true);
+		to_ft = freetype_with_migrate(from_ft, MIGRATE_ISOLATE);
 	} else {
-		from_mt = MIGRATE_ISOLATE;
-		to_mt = __get_pfnblock_flags_mask(page, page_to_pfn(page),
-						  MIGRATETYPE_MASK);
+		to_ft = __get_pfnblock_freetype(page, page_to_pfn(page), true);
+		from_ft = freetype_with_migrate(to_ft, MIGRATE_ISOLATE);
 	}
 
-	__move_freepages_block(zone, start_pfn, from_mt, to_mt);
+	__move_freepages_block(zone, start_pfn, from_ft, to_ft);
 	toggle_pageblock_isolate(pfn_to_page(start_pfn), isolate);
 
 	return true;
@@ -2244,15 +2304,16 @@ static bool should_try_claim_block(unsigned int order, int start_mt)
 
 /*
  * Check whether there is a suitable fallback freepage with requested order.
- * If claimable is true, this function returns fallback_mt only if
+ * If claimable is true, this function returns a fallback only if
  * we would do this whole-block claiming. This would help to reduce
  * fragmentation due to mixed migratetype pages in one pageblock.
  */
 enum fallback_result
 find_suitable_fallback(struct free_area *area, unsigned int order,
-		       int migratetype, bool claimable, unsigned int *mt_out)
+		       freetype_t freetype, bool claimable, freetype_t *ft_out)
 {
 	int i;
+	enum migratetype migratetype = free_to_migratetype(freetype);
 
 	if (claimable && !should_try_claim_block(order, migratetype))
 		return FALLBACK_NOCLAIM;
@@ -2262,10 +2323,18 @@ find_suitable_fallback(struct free_area *area, unsigned int order,
 
 	for (i = 0; i < MIGRATE_PCPTYPES - 1 ; i++) {
 		int fallback_mt = fallbacks[migratetype][i];
+		/*
+		 * Fallback to different migratetypes, but currently always with
+		 * the same freetype flags.
+		 */
+		freetype_t fallback_ft = freetype_with_migrate(freetype, fallback_mt);
 
-		if (!free_area_empty(area, fallback_mt)) {
-			if (mt_out)
-				*mt_out = fallback_mt;
+		if (freetype_idx(fallback_ft) < 0)
+			continue;
+
+		if (!free_area_empty(area, fallback_ft)) {
+			if (ft_out)
+				*ft_out = fallback_ft;
 			return FALLBACK_FOUND;
 		}
 	}
@@ -2282,20 +2351,22 @@ find_suitable_fallback(struct free_area *area, unsigned int order,
  */
 static struct page *
 try_to_claim_block(struct zone *zone, struct page *page,
-		   int current_order, int order, int start_type,
-		   int block_type, unsigned int alloc_flags)
+		   int current_order, int order, freetype_t start_type,
+		   freetype_t block_type, unsigned int alloc_flags)
 {
 	int free_pages, movable_pages, alike_pages;
+	int block_mt = free_to_migratetype(block_type);
+	int start_mt = free_to_migratetype(start_type);
 	unsigned long start_pfn;
 
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
 		unsigned int nr_added;
 
-		del_page_from_free_list(page, zone, current_order, block_type);
-		change_pageblock_range(page, current_order, start_type);
+		del_page_from_free_list(page, zone, current_order, block_mt);
+		change_pageblock_range(page, current_order, start_mt);
 		nr_added = expand(zone, page, order, current_order, start_type);
-		account_freepages(zone, nr_added, start_type);
+		account_freepages(zone, nr_added, start_mt);
 		return page;
 	}
 
@@ -2317,7 +2388,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	 * For movable allocation, it's the number of movable pages which
 	 * we just obtained. For other types it's a bit more tricky.
 	 */
-	if (start_type == MIGRATE_MOVABLE) {
+	if (start_mt == MIGRATE_MOVABLE) {
 		alike_pages = movable_pages;
 	} else {
 		/*
@@ -2327,7 +2398,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
 		 * vice versa, be conservative since we can't distinguish the
 		 * exact migratetype of non-movable pages.
 		 */
-		if (block_type == MIGRATE_MOVABLE)
+		if (block_mt == MIGRATE_MOVABLE)
 			alike_pages = pageblock_nr_pages
 						- (free_pages + movable_pages);
 		else
@@ -2340,7 +2411,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
 		__move_freepages_block(zone, start_pfn, block_type, start_type);
-		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
+		set_pageblock_migratetype(pfn_to_page(start_pfn), start_mt);
 		return __rmqueue_smallest(zone, order, start_type);
 	}
 
@@ -2356,14 +2427,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
  * condition simpler.
  */
 static __always_inline struct page *
-__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
+__rmqueue_claim(struct zone *zone, int order, freetype_t start_freetype,
 						unsigned int alloc_flags)
 {
 	struct free_area *area;
 	int current_order;
 	int min_order = order;
 	struct page *page;
-	int fallback_mt;
 
 	/*
 	 * Do not steal pages from freelists belonging to other pageblocks
@@ -2380,11 +2450,13 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 	 */
 	for (current_order = MAX_PAGE_ORDER; current_order >= min_order;
 				--current_order) {
+		int start_mt = free_to_migratetype(start_freetype);
 		enum fallback_result result;
+		freetype_t fallback_ft;
 
 		area = &(zone->free_area[current_order]);
-		result = find_suitable_fallback(area, current_order,
-						start_migratetype, true, &fallback_mt);
+		result = find_suitable_fallback(area, current_order, start_freetype,
+						true, &fallback_ft);
 
 		if (result == FALLBACK_EMPTY)
 			continue;
@@ -2392,13 +2464,13 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 		if (result == FALLBACK_NOCLAIM)
 			break;
 
-		page = get_page_from_free_area(area, fallback_mt);
+		page = get_page_from_free_area(area, fallback_ft);
 		page = try_to_claim_block(zone, page, current_order, order,
-					  start_migratetype, fallback_mt,
+					  start_freetype, fallback_ft,
 					  alloc_flags);
 		if (page) {
 			trace_mm_page_alloc_extfrag(page, order, current_order,
-						    start_migratetype, fallback_mt);
+				start_mt, free_to_migratetype(fallback_ft));
 			return page;
 		}
 	}
@@ -2411,26 +2483,27 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
  * the block as its current migratetype, potentially causing fragmentation.
  */
 static __always_inline struct page *
-__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+__rmqueue_steal(struct zone *zone, int order, freetype_t start_freetype)
 {
 	struct free_area *area;
 	int current_order;
 	struct page *page;
-	int fallback_mt;
 
 	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
 		enum fallback_result result;
+		freetype_t fallback_ft;
 
 		area = &(zone->free_area[current_order]);
-		result = find_suitable_fallback(area, current_order, start_migratetype,
-						false, &fallback_mt);
+		result = find_suitable_fallback(area, current_order, start_freetype,
+						false, &fallback_ft);
 		if (result == FALLBACK_EMPTY)
 			continue;
 
-		page = get_page_from_free_area(area, fallback_mt);
-		page_del_and_expand(zone, page, order, current_order, fallback_mt);
+		page = get_page_from_free_area(area, fallback_ft);
+		page_del_and_expand(zone, page, order, current_order, fallback_ft);
 		trace_mm_page_alloc_extfrag(page, order, current_order,
-					    start_migratetype, fallback_mt);
+					    free_to_migratetype(start_freetype),
+						free_to_migratetype(fallback_ft));
 		return page;
 	}
 
@@ -2449,7 +2522,7 @@ enum rmqueue_mode {
  * Call me with the zone lock already held.
  */
 static __always_inline struct page *
-__rmqueue(struct zone *zone, unsigned int order, int migratetype,
+__rmqueue(struct zone *zone, unsigned int order, freetype_t freetype,
 	  unsigned int alloc_flags, enum rmqueue_mode *mode)
 {
 	struct page *page;
@@ -2463,7 +2536,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		if (alloc_flags & ALLOC_CMA &&
 		    zone_page_state(zone, NR_FREE_CMA_PAGES) >
 		    zone_page_state(zone, NR_FREE_PAGES) / 2) {
-			page = __rmqueue_cma_fallback(zone, order);
+			page = __rmqueue_cma_fallback(zone, order,
+						freetype_flags(freetype));
 			if (page)
 				return page;
 		}
@@ -2480,13 +2554,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 	 */
 	switch (*mode) {
 	case RMQUEUE_NORMAL:
-		page = __rmqueue_smallest(zone, order, migratetype);
+		page = __rmqueue_smallest(zone, order, freetype);
 		if (page)
 			return page;
 		fallthrough;
 	case RMQUEUE_CMA:
 		if (alloc_flags & ALLOC_CMA) {
-			page = __rmqueue_cma_fallback(zone, order);
+			page = __rmqueue_cma_fallback(zone, order,
+						freetype_flags(freetype));
 			if (page) {
 				*mode = RMQUEUE_CMA;
 				return page;
@@ -2494,7 +2569,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		}
 		fallthrough;
 	case RMQUEUE_CLAIM:
-		page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
+		page = __rmqueue_claim(zone, order, freetype, alloc_flags);
 		if (page) {
 			/* Replenished preferred freelist, back to normal mode. */
 			*mode = RMQUEUE_NORMAL;
@@ -2503,7 +2578,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		fallthrough;
 	case RMQUEUE_STEAL:
 		if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
-			page = __rmqueue_steal(zone, order, migratetype);
+			page = __rmqueue_steal(zone, order, freetype);
 			if (page) {
 				*mode = RMQUEUE_STEAL;
 				return page;
@@ -2520,7 +2595,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype, unsigned int alloc_flags)
+			freetype_t freetype, unsigned int alloc_flags)
 {
 	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 	unsigned long flags;
@@ -2533,7 +2608,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		zone_lock_irqsave(zone, flags);
 	}
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype,
+		struct page *page = __rmqueue(zone, order, freetype,
 					      alloc_flags, &rmqm);
 		if (unlikely(page == NULL))
 			break;
@@ -2828,7 +2903,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
  * reacquired. Return true if pcp is locked, false otherwise.
  */
 static bool free_frozen_page_commit(struct zone *zone,
-		struct per_cpu_pages *pcp, struct page *page, int migratetype,
+		struct per_cpu_pages *pcp, struct page *page, freetype_t freetype,
 		unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
@@ -2845,7 +2920,7 @@ static bool free_frozen_page_commit(struct zone *zone,
 	 */
 	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
-	pindex = order_to_pindex(migratetype, order);
+	pindex = order_to_pindex(free_to_migratetype(freetype), order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
@@ -2939,6 +3014,7 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	struct zone *zone;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
+	freetype_t freetype;
 
 	if (!pcp_allowed_order(order)) {
 		__free_pages_ok(page, order, fpi_flags);
@@ -2956,13 +3032,14 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	 * excessively into the page allocator
 	 */
 	zone = page_zone(page);
-	migratetype = get_pfnblock_migratetype(page, pfn);
+	freetype = get_pfnblock_freetype(page, pfn);
+	migratetype = free_to_migratetype(freetype);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			free_one_page(zone, page, pfn, order, fpi_flags);
 			return;
 		}
-		migratetype = MIGRATE_MOVABLE;
+		freetype = freetype_with_migrate(freetype, MIGRATE_MOVABLE);
 	}
 
 	if (unlikely((fpi_flags & FPI_TRYLOCK) && IS_ENABLED(CONFIG_PREEMPT_RT)
@@ -2972,7 +3049,7 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	}
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
+		if (!free_frozen_page_commit(zone, pcp, page, freetype,
 						order, fpi_flags))
 			return;
 		pcp_spin_unlock(pcp);
@@ -3029,10 +3106,12 @@ void free_unref_folios(struct folio_batch *folios)
 		struct zone *zone = folio_zone(folio);
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = (unsigned long)folio->private;
+		freetype_t freetype;
 		int migratetype;
 
 		folio->private = NULL;
-		migratetype = get_pfnblock_migratetype(&folio->page, pfn);
+		freetype = get_pfnblock_freetype(&folio->page, pfn);
+		migratetype = free_to_migratetype(freetype);
 
 		/* Different zone requires a different pcp lock */
 		if (zone != locked_zone ||
@@ -3071,11 +3150,12 @@ void free_unref_folios(struct folio_batch *folios)
 		 * to the MIGRATE_MOVABLE pcp list.
 		 */
 		if (unlikely(migratetype >= MIGRATE_PCPTYPES))
-			migratetype = MIGRATE_MOVABLE;
+			freetype = freetype_with_migrate(freetype,
+							MIGRATE_MOVABLE);
 
 		trace_mm_page_free_batched(&folio->page);
 		if (!free_frozen_page_commit(zone, pcp, &folio->page,
-				migratetype, order, FPI_NONE)) {
+				freetype, order, FPI_NONE)) {
 			pcp = NULL;
 			locked_zone = NULL;
 		}
@@ -3143,14 +3223,16 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	if (order >= pageblock_order - 1) {
 		struct page *endpage = page + (1 << order) - 1;
 		for (; page < endpage; page += pageblock_nr_pages) {
-			int mt = get_pageblock_migratetype(page);
+			freetype_t old_ft = get_pageblock_freetype(page);
+			freetype_t new_ft = freetype_with_migrate(old_ft,
+				MIGRATE_MOVABLE);
+
 			/*
 			 * Only change normal pageblocks (i.e., they can merge
 			 * with others)
 			 */
 			if (migratetype_is_mergeable(mt))
-				move_freepages_block(zone, page, mt,
-						     MIGRATE_MOVABLE);
+				move_freepages_block(zone, page, old_ft, new_ft);
 		}
 	}
 
@@ -3161,12 +3243,13 @@ int __isolate_free_page(struct page *page, unsigned int order)
  * __putback_isolated_page - Return a now-isolated page back where we got it
  * @page: Page that was isolated
  * @order: Order of the isolated page
- * @mt: The page's pageblock's migratetype
+ * @freetype: The page's pageblock's freetype
  *
  * This function is meant to return a page pulled from the free lists via
  * __isolate_free_page back to the free lists they were pulled from.
  */
-void __putback_isolated_page(struct page *page, unsigned int order, int mt)
+void __putback_isolated_page(struct page *page, unsigned int order,
+			     freetype_t freetype)
 {
 	struct zone *zone = page_zone(page);
 
@@ -3174,7 +3257,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
 	lockdep_assert_held(&zone->_lock);
 
 	/* Return isolated page to tail of freelist. */
-	__free_one_page(page, page_to_pfn(page), zone, order, mt,
+	__free_one_page(page, page_to_pfn(page), zone, order, freetype,
 			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL);
 }
 
@@ -3207,10 +3290,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 static __always_inline
 struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			   unsigned int order, unsigned int alloc_flags,
-			   int migratetype)
+			   freetype_t freetype)
 {
 	struct page *page;
 	unsigned long flags;
+	freetype_t ft_high = freetype_with_migrate(freetype,
+						       MIGRATE_HIGHATOMIC);
 
 	do {
 		page = NULL;
@@ -3221,11 +3306,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			zone_lock_irqsave(zone, flags);
 		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
-			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			page = __rmqueue_smallest(zone, order, ft_high);
 		if (!page) {
 			enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 
-			page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
+			page = __rmqueue(zone, order, freetype, alloc_flags, &rmqm);
 
 			/*
 			 * If the allocation fails, allow OOM handling and
@@ -3234,7 +3319,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			 * high-order atomic allocation in the future.
 			 */
 			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
-				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+				page = __rmqueue_smallest(zone, order, ft_high);
 
 			if (!page) {
 				zone_unlock_irqrestore(zone, flags);
@@ -3303,7 +3388,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
-			int migratetype,
+			freetype_t freetype,
 			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
 			struct list_head *list)
@@ -3317,7 +3402,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 			alloced = rmqueue_bulk(zone, order,
 					batch, list,
-					migratetype, alloc_flags);
+					freetype, alloc_flags);
 
 			pcp->count += alloced << order;
 			if (unlikely(list_empty(list)))
@@ -3335,7 +3420,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			int migratetype, unsigned int alloc_flags)
+			freetype_t freetype, unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -3352,8 +3437,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 * frees.
 	 */
 	pcp->free_count >>= 1;
-	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	list = &pcp->lists[order_to_pindex(free_to_migratetype(freetype), order)];
+	page = __rmqueue_pcplist(zone, order, freetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3378,19 +3463,19 @@ static inline
 struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, unsigned int alloc_flags,
-			int migratetype)
+			freetype_t freetype)
 {
 	struct page *page;
 
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				       migratetype, alloc_flags);
+				       freetype, alloc_flags);
 		if (likely(page))
 			goto out;
 	}
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
-							migratetype);
+							freetype);
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3412,7 +3497,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 static void reserve_highatomic_pageblock(struct page *page, int order,
 					 struct zone *zone)
 {
-	int mt;
+	freetype_t ft, ft_high;
 	unsigned long max_managed, flags;
 
 	/*
@@ -3434,13 +3519,14 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
 		goto out_unlock;
 
 	/* Yoink! */
-	mt = get_pageblock_migratetype(page);
+	ft = get_pageblock_freetype(page);
 	/* Only reserve normal pageblocks (i.e., they can merge with others) */
-	if (!migratetype_is_mergeable(mt))
+	if (!migratetype_is_mergeable(free_to_migratetype(ft)))
 		goto out_unlock;
 
+	ft_high = freetype_with_migrate(ft, MIGRATE_HIGHATOMIC);
 	if (order < pageblock_order) {
-		if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1)
+		if (move_freepages_block(zone, page, ft, ft_high) == -1)
 			goto out_unlock;
 		zone->nr_reserved_highatomic += pageblock_nr_pages;
 	} else {
@@ -3485,9 +3571,11 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		zone_lock_irqsave(zone, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
+			freetype_t ft_high = freetype_with_migrate(ac->freetype,
+							MIGRATE_HIGHATOMIC);
 			unsigned long size;
 
-			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
+			page = get_page_from_free_area(area, ft_high);
 			if (!page)
 				continue;
 
@@ -3514,14 +3602,14 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 */
 			if (order < pageblock_order)
 				ret = move_freepages_block(zone, page,
-							   MIGRATE_HIGHATOMIC,
-							   ac->migratetype);
+							   ft_high,
+							   ac->freetype);
 			else {
 				move_to_free_list(page, zone, order,
-						  MIGRATE_HIGHATOMIC,
-						  ac->migratetype);
+						  ft_high,
+						  ac->freetype);
 				change_pageblock_range(page, order,
-						       ac->migratetype);
+					free_to_migratetype(ac->freetype));
 				ret = 1;
 			}
 			/*
@@ -3627,18 +3715,18 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			continue;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!free_area_empty(area, mt))
+			if (!free_area_empty(area, migrate_to_freetype(mt, 0)))
 				return true;
 		}
 
 #ifdef CONFIG_CMA
 		if ((alloc_flags & ALLOC_CMA) &&
-		    !free_area_empty(area, MIGRATE_CMA)) {
+		     !free_area_empty(area, migrate_to_freetype(MIGRATE_CMA, 0))) {
 			return true;
 		}
 #endif
 		if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
-		    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
+		    !free_area_empty(area, migrate_to_freetype(MIGRATE_HIGHATOMIC, 0))) {
 			return true;
 		}
 	}
@@ -3762,7 +3850,7 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
 						  unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
-	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (free_to_migratetype(gfp_freetype(gfp_mask)) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	return alloc_flags;
@@ -3925,7 +4013,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 try_this_zone:
 		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
-				gfp_mask, alloc_flags, ac->migratetype);
+				gfp_mask, alloc_flags, ac->freetype);
 		if (page) {
 			prep_new_page(page, order, gfp_mask, alloc_flags);
 
@@ -4694,6 +4782,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int reserve_flags;
 	bool compact_first = false;
 	bool can_retry_reserves = true;
+	enum migratetype migratetype = free_to_migratetype(ac->freetype);
 
 	if (unlikely(nofail)) {
 		/*
@@ -4724,8 +4813,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * try prevent permanent fragmentation by migrating from blocks of the
 	 * same migratetype.
 	 */
-	if (can_compact && (costly_order || (order > 0 &&
-					ac->migratetype != MIGRATE_MOVABLE))) {
+	if (can_compact && (costly_order || (order > 0 && migratetype != MIGRATE_MOVABLE))) {
 		compact_first = true;
 		compact_priority = INIT_COMPACT_PRIORITY;
 	}
@@ -4969,7 +5057,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	ac->highest_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
-	ac->migratetype = gfp_migratetype(gfp_mask);
+	ac->freetype = gfp_freetype(gfp_mask);
 
 	if (cpusets_enabled()) {
 		*alloc_gfp |= __GFP_HARDWALL;
@@ -5133,7 +5221,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
+	pcp_list = &pcp->lists[order_to_pindex(free_to_migratetype(ac.freetype), 0)];
 	while (nr_populated < nr_pages) {
 
 		/* Skip existing pages */
@@ -5142,7 +5230,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 			continue;
 		}
 
-		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
+		page = __rmqueue_pcplist(zone, 0, ac.freetype, alloc_flags,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
@@ -5236,7 +5324,8 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 		page = NULL;
 	}
 
-	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	trace_mm_page_alloc(page, order, alloc_gfp,
+			    free_to_migratetype(ac.freetype));
 	kmsan_alloc_page(page, order, alloc_gfp);
 
 	return page;
@@ -7461,11 +7550,11 @@ EXPORT_SYMBOL(is_free_buddy_page);
 
 #ifdef CONFIG_MEMORY_FAILURE
 static inline void add_to_free_list(struct page *page, struct zone *zone,
-				    unsigned int order, int migratetype,
+				    unsigned int order, freetype_t freetype,
 				    bool tail)
 {
-	__add_to_free_list(page, zone, order, migratetype, tail);
-	account_freepages(zone, 1 << order, migratetype);
+	__add_to_free_list(page, zone, order, freetype, tail);
+	account_freepages(zone, 1 << order, free_to_migratetype(freetype));
 }
 
 /*
@@ -7474,7 +7563,7 @@ static inline void add_to_free_list(struct page *page, struct zone *zone,
  */
 static void break_down_buddy_pages(struct zone *zone, struct page *page,
 				   struct page *target, int low, int high,
-				   int migratetype)
+				   freetype_t freetype)
 {
 	unsigned long size = 1 << high;
 	struct page *current_buddy;
@@ -7493,7 +7582,7 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		add_to_free_list(current_buddy, zone, high, migratetype, false);
+		add_to_free_list(current_buddy, zone, high, freetype, false);
 		set_buddy_order(current_buddy, high);
 	}
 }
@@ -7516,13 +7605,13 @@ bool take_page_off_buddy(struct page *page)
 
 		if (PageBuddy(page_head) && page_order >= order) {
 			unsigned long pfn_head = page_to_pfn(page_head);
-			int migratetype = get_pfnblock_migratetype(page_head,
-								   pfn_head);
+			freetype_t freetype = get_pfnblock_freetype(page_head,
+								    pfn_head);
 
 			del_page_from_free_list(page_head, zone, page_order,
-						migratetype);
+						free_to_migratetype(freetype));
 			break_down_buddy_pages(zone, page_head, page, 0,
-						page_order, migratetype);
+						page_order, freetype);
 			SetPageHWPoisonTakenOff(page);
 			ret = true;
 			break;
@@ -7546,10 +7635,10 @@ bool put_page_back_buddy(struct page *page)
 	zone_lock_irqsave(zone, flags);
 	if (put_page_testzero(page)) {
 		unsigned long pfn = page_to_pfn(page);
-		int migratetype = get_pfnblock_migratetype(page, pfn);
+		freetype_t freetype = get_pfnblock_freetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
-		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
+		__free_one_page(page, pfn, zone, 0, freetype, FPI_NONE);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
 		}
@@ -7790,7 +7879,8 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
 		__free_frozen_pages(page, order, FPI_TRYLOCK);
 		page = NULL;
 	}
-	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	trace_mm_page_alloc(page, order, alloc_gfp,
+			    free_to_migratetype(ac.freetype));
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
 }
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index e8414e9a718a2..8d62e1a424845 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -277,7 +277,7 @@ static void unset_migratetype_isolate(struct page *page)
 		WARN_ON_ONCE(!pageblock_unisolate_and_move_free_pages(zone, page));
 	} else {
 		clear_pageblock_isolate(page);
-		__putback_isolated_page(page, order, get_pageblock_migratetype(page));
+		__putback_isolated_page(page, order, get_pageblock_freetype(page));
 	}
 	zone->nr_isolate_pageblock--;
 out:
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 109f2f28f5b17..557254c20b98f 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -481,7 +481,8 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 				goto ext_put_continue;
 
 			page_owner = get_page_owner(page_ext);
-			page_mt = gfp_migratetype(page_owner->gfp_mask);
+			page_mt = free_to_migratetype(
+					gfp_freetype(page_owner->gfp_mask));
 			if (pageblock_mt != page_mt) {
 				if (is_migrate_cma(pageblock_mt))
 					count[MIGRATE_MOVABLE]++;
@@ -566,7 +567,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
-	page_mt  = gfp_migratetype(page_owner->gfp_mask);
+	page_mt  = free_to_migratetype(gfp_freetype(page_owner->gfp_mask));
 	ret += scnprintf(kbuf + ret, count - ret,
 			"PFN 0x%lx type %s Block %lu type %s Flags %pGp\n",
 			pfn,
@@ -617,7 +618,7 @@ void __dump_page_owner(const struct page *page)
 
 	page_owner = get_page_owner(page_ext);
 	gfp_mask = page_owner->gfp_mask;
-	mt = gfp_migratetype(gfp_mask);
+	mt = free_to_migratetype(gfp_freetype(gfp_mask));
 
 	if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) {
 		pr_alert("page_owner info is not present (never set?)\n");
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 8bf2e5bc885e5..aed4b13f6895b 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -114,10 +114,10 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 	 */
 	do {
 		struct page *page = sg_page(sg);
-		int mt = get_pageblock_migratetype(page);
+		freetype_t ft = get_pageblock_freetype(page);
 		unsigned int order = get_order(sg->length);
 
-		__putback_isolated_page(page, order, mt);
+		__putback_isolated_page(page, order, ft);
 
 		/* If the pages were not reported due to error skip flagging */
 		if (!reported)
diff --git a/mm/show_mem.c b/mm/show_mem.c
index d7d1b6cd6442d..3ca8891326296 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -374,7 +374,9 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!free_area_empty(area, type))
+				freetype_t ft = migrate_to_freetype(type, 0);
+
+				if (!free_area_empty(area, ft))
 					types[order] |= 1 << type;
 			}
 		}

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 11/22] mm: move migratetype definitions to freetype.h
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (9 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Since migratetypes are a sub-element of freetype, move the pure
definitions into the new freetype.h.

This will enable referring to these raw types from pageblock-flags.h.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/freetype.h | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h   | 73 -----------------------------------------
 2 files changed, 84 insertions(+), 73 deletions(-)

diff --git a/include/linux/freetype.h b/include/linux/freetype.h
index 9f857d10bb5db..11bd6d2b94349 100644
--- a/include/linux/freetype.h
+++ b/include/linux/freetype.h
@@ -3,6 +3,66 @@
 #define _LINUX_FREETYPE_H
 
 #include <linux/types.h>
+#include <linux/mmdebug.h>
+
+/*
+ * A migratetype is the part of a freetype that encodes the mobility
+ * requirements for the allocations the freelist is intended to serve.
+ *
+ * It's also currently overloaded to encode page isolation state.
+ */
+enum migratetype {
+	MIGRATE_UNMOVABLE,
+	MIGRATE_MOVABLE,
+	MIGRATE_RECLAIMABLE,
+	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
+	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
+#ifdef CONFIG_CMA
+	/*
+	 * MIGRATE_CMA migration type is designed to mimic the way
+	 * ZONE_MOVABLE works.  Only movable pages can be allocated
+	 * from MIGRATE_CMA pageblocks and page allocator never
+	 * implicitly change migration type of MIGRATE_CMA pageblock.
+	 *
+	 * The way to use it is to change migratetype of a range of
+	 * pageblocks to MIGRATE_CMA which can be done by
+	 * __free_pageblock_cma() function.
+	 */
+	MIGRATE_CMA,
+	__MIGRATE_TYPE_END = MIGRATE_CMA,
+#else
+	__MIGRATE_TYPE_END = MIGRATE_HIGHATOMIC,
+#endif
+#ifdef CONFIG_MEMORY_ISOLATION
+	MIGRATE_ISOLATE,	/* can't allocate from here */
+#endif
+	MIGRATE_TYPES
+};
+
+/* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
+extern const char * const migratetype_names[MIGRATE_TYPES];
+
+#ifdef CONFIG_CMA
+#  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
+#else
+#  define is_migrate_cma(migratetype) false
+#endif
+
+static inline bool is_migrate_movable(int mt)
+{
+	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
+}
+
+/*
+ * Check whether a migratetype can be merged with another migratetype.
+ *
+ * It is only mergeable when it can fall back to other migratetypes for
+ * allocation. See fallbacks[MIGRATE_TYPES][3] in page_alloc.c.
+ */
+static inline bool migratetype_is_mergeable(int mt)
+{
+	return mt < MIGRATE_PCPTYPES;
+}
 
 /*
  * A freetype is the index used to identify free lists. This consists of a
@@ -35,4 +95,28 @@ static inline bool freetypes_equal(freetype_t a, freetype_t b)
 	return a.migratetype == b.migratetype;
 }
 
+static inline freetype_t migrate_to_freetype(enum migratetype mt,
+					     unsigned int flags)
+{
+	freetype_t freetype;
+
+	/* No flags supported yet. */
+	VM_WARN_ON_ONCE(flags);
+
+	freetype.migratetype = mt;
+	return freetype;
+}
+
+static inline enum migratetype free_to_migratetype(freetype_t freetype)
+{
+	return freetype.migratetype;
+}
+
+/* Convenience helper, return the freetype modified to have the migratetype. */
+static inline freetype_t freetype_with_migrate(freetype_t freetype,
+					       enum migratetype migratetype)
+{
+	return migrate_to_freetype(migratetype, freetype_flags(freetype));
+}
+
 #endif /* _LINUX_FREETYPE_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c456ddd1f5979..af662e4912591 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,39 +116,7 @@
 #define __NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
 #define NR_VMEMMAP_TAILS (__NR_VMEMMAP_TAILS > 0 ? __NR_VMEMMAP_TAILS : 0)
 
-enum migratetype {
-	MIGRATE_UNMOVABLE,
-	MIGRATE_MOVABLE,
-	MIGRATE_RECLAIMABLE,
-	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
-	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
-	/*
-	 * MIGRATE_CMA migration type is designed to mimic the way
-	 * ZONE_MOVABLE works.  Only movable pages can be allocated
-	 * from MIGRATE_CMA pageblocks and page allocator never
-	 * implicitly change migration type of MIGRATE_CMA pageblock.
-	 *
-	 * The way to use it is to change migratetype of a range of
-	 * pageblocks to MIGRATE_CMA which can be done by
-	 * __free_pageblock_cma() function.
-	 */
-	MIGRATE_CMA,
-	__MIGRATE_TYPE_END = MIGRATE_CMA,
-#else
-	__MIGRATE_TYPE_END = MIGRATE_HIGHATOMIC,
-#endif
-#ifdef CONFIG_MEMORY_ISOLATION
-	MIGRATE_ISOLATE,	/* can't allocate from here */
-#endif
-	MIGRATE_TYPES
-};
-
-/* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
-extern const char * const migratetype_names[MIGRATE_TYPES];
-
-#ifdef CONFIG_CMA
-#  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #  define is_migrate_cma_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_CMA)
 /*
  * __dump_folio() in mm/debug.c passes a folio pointer to on-stack struct folio,
@@ -157,27 +125,10 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
 #  define is_migrate_cma_folio(folio, pfn) \
 	(get_pfnblock_migratetype(&folio->page, pfn) == MIGRATE_CMA)
 #else
-#  define is_migrate_cma(migratetype) false
 #  define is_migrate_cma_page(_page) false
 #  define is_migrate_cma_folio(folio, pfn) false
 #endif
 
-static inline bool is_migrate_movable(int mt)
-{
-	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
-}
-
-/*
- * Check whether a migratetype can be merged with another migratetype.
- *
- * It is only mergeable when it can fall back to other migratetypes for
- * allocation. See fallbacks[MIGRATE_TYPES][3] in page_alloc.c.
- */
-static inline bool migratetype_is_mergeable(int mt)
-{
-	return mt < MIGRATE_PCPTYPES;
-}
-
 #define for_each_free_list(list, zone, order) \
 	for (order = 0; order < NR_PAGE_ORDERS; order++) \
 		for (unsigned int idx = 0; \
@@ -185,30 +136,6 @@ static inline bool migratetype_is_mergeable(int mt)
 		     idx < NR_FREETYPE_IDXS; \
 		     idx++)
 
-static inline freetype_t migrate_to_freetype(enum migratetype mt,
-					     unsigned int flags)
-{
-	freetype_t freetype;
-
-	/* No flags supported yet. */
-	VM_WARN_ON_ONCE(flags);
-
-	freetype.migratetype = mt;
-	return freetype;
-}
-
-static inline enum migratetype free_to_migratetype(freetype_t freetype)
-{
-	return freetype.migratetype;
-}
-
-/* Convenience helper, return the freetype modified to have the migratetype. */
-static inline freetype_t freetype_with_migrate(freetype_t freetype,
-					       enum migratetype migratetype)
-{
-	return migrate_to_freetype(migratetype, freetype_flags(freetype));
-}
-
 extern int page_group_by_mobility_disabled;
 
 freetype_t get_pfnblock_freetype(const struct page *page, unsigned long pfn);

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 12/22] mm: add definitions for allocating unmapped pages
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (10 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Create __GFP_UNMAPPED, which requests pages that are not present in the
direct map. Since this feature has a cost (e.g. more freelists), it's
behind a kconfig. Unlike other conditionally-defined GFP flags, it
doesn't fall back to being 0. This prevents building code that uses
__GFP_UNMAPPED but doesn't depend on the necessary kconfig, since that
would lead to invisible security issues.

Create a freetype flag to record that pages on the freelists with this
flag are unmapped. This is currently only needed for MIGRATE_UNMOVABLE
pages, so the freetype encoding remains trivial.

Also create the corresponding pageblock flag to record the same thing.

To keep patches from being too overwhelming, the actual implementation
is added separately, this is just types, Kconfig boilerplate, etc.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/freetype.h       | 43 +++++++++++++++++++++++++++++++++---------
 include/linux/gfp_types.h      | 26 +++++++++++++++++++++++++
 include/trace/events/mmflags.h |  9 ++++++++-
 mm/Kconfig                     |  4 ++++
 4 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/include/linux/freetype.h b/include/linux/freetype.h
index 11bd6d2b94349..a8d0c344a5792 100644
--- a/include/linux/freetype.h
+++ b/include/linux/freetype.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_FREETYPE_H
 #define _LINUX_FREETYPE_H
 
+#include <linux/log2.h>
 #include <linux/types.h>
 #include <linux/mmdebug.h>
 
@@ -64,35 +65,61 @@ static inline bool migratetype_is_mergeable(int mt)
 	return mt < MIGRATE_PCPTYPES;
 }
 
+enum {
+	/* Defined unconditionally as a hack to avoid a zero-width bitfield. */
+	FREETYPE_UNMAPPED_BIT,
+	NUM_FREETYPE_FLAGS,
+};
+
 /*
  * A freetype is the index used to identify free lists. This consists of a
  * migratetype, and other bits which encode orthogonal properties of memory.
  */
 typedef struct {
-	int migratetype;
+	unsigned int migratetype : order_base_2(MIGRATE_TYPES);
+	unsigned int flags : NUM_FREETYPE_FLAGS;
 } freetype_t;
 
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+#define FREETYPE_UNMAPPED			BIT(FREETYPE_UNMAPPED_BIT)
+#define NUM_UNMAPPED_FREETYPES			1
+#else
+#define FREETYPE_UNMAPPED			0
+#define NUM_UNMAPPED_FREETYPES			0
+#endif
+
 /*
  * Return a dense linear index for freetypes that have lists in the free area.
  * Return -1 for other freetypes.
  */
 static inline int freetype_idx(freetype_t freetype)
 {
+	/* For FREETYPE_UNMAPPED, only MIGRATE_UNMOVABLE has an index. */
+	if (freetype.flags & FREETYPE_UNMAPPED) {
+		VM_WARN_ON_ONCE(freetype.flags & ~FREETYPE_UNMAPPED);
+		if (!IS_ENABLED(CONFIG_PAGE_ALLOC_UNMAPPED))
+			return -1;
+		if (freetype.migratetype != MIGRATE_UNMOVABLE)
+			return -1;
+		return MIGRATE_TYPES;
+	}
+	/* No other flags are supported. */
+	VM_WARN_ON_ONCE(freetype.flags);
+
 	return freetype.migratetype;
 }
 
-/* No freetype flags actually exist yet. */
-#define NR_FREETYPE_IDXS MIGRATE_TYPES
+/* One for each migratetype, plus one for MIGRATE_UNMOVABLE-FREETYPE_UNMAPPED */
+#define NR_FREETYPE_IDXS (MIGRATE_TYPES + NUM_UNMAPPED_FREETYPES)
 
 static inline unsigned int freetype_flags(freetype_t freetype)
 {
-	/* No flags supported yet. */
-	return 0;
+	return freetype.flags;
 }
 
 static inline bool freetypes_equal(freetype_t a, freetype_t b)
 {
-	return a.migratetype == b.migratetype;
+	return a.migratetype == b.migratetype && a.flags == b.flags;
 }
 
 static inline freetype_t migrate_to_freetype(enum migratetype mt,
@@ -100,10 +127,8 @@ static inline freetype_t migrate_to_freetype(enum migratetype mt,
 {
 	freetype_t freetype;
 
-	/* No flags supported yet. */
-	VM_WARN_ON_ONCE(flags);
-
 	freetype.migratetype = mt;
+	freetype.flags = flags;
 	return freetype;
 }
 
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6c75df30a281d..61d32697ea335 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -56,6 +56,9 @@ enum {
 	___GFP_NOLOCKDEP_BIT,
 #endif
 	___GFP_NO_OBJ_EXT_BIT,
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+	___GFP_UNMAPPED_BIT,
+#endif
 	___GFP_LAST_BIT
 };
 
@@ -97,6 +100,10 @@ enum {
 #define ___GFP_NOLOCKDEP	0
 #endif
 #define ___GFP_NO_OBJ_EXT       BIT(___GFP_NO_OBJ_EXT_BIT)
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+#define ___GFP_UNMAPPED		BIT(___GFP_UNMAPPED_BIT)
+/* No #else - __GFP_UNMAPPED should never be a nop. Break the build if it isn't supported. */
+#endif
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -295,6 +302,25 @@ enum {
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
+/*
+ * Allocate pages that aren't present in the direct map. If the caller changes
+ * direct map presence, it must be restored to the previous state before freeing
+ * the page. (This is true regardless of __GFP_UNMAPPED).
+ *
+ * This uses the mermap (when __GFP_ZERO), so it's only valid to allocate with
+ * this flag where that's valid, namely from process context after the mermap
+ * has been initialised for that process. This also means that the allocator
+ * leaves behind stale TLB entries in the mermap region. The caller is
+ * responsible for ensuring they are flushed as needed.
+ *
+ * This is currently incompatible with __GFP_MOVABLE and __GFP_RECLAIMABLE, but
+ * only because of allocator implementation details, if a usecase arises this
+ * restriction could be dropped.
+ */
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+#define __GFP_UNMAPPED ((__force gfp_t)___GFP_UNMAPPED)
+#endif
+
 /* Room for N __GFP_FOO bits */
 #define __GFP_BITS_SHIFT ___GFP_LAST_BIT
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a6e5a44c9b429..bb365da355b3a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -61,11 +61,18 @@
 # define TRACE_GFP_FLAGS_SLAB
 #endif
 
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+# define TRACE_GFP_FLAGS_UNMAPPED TRACE_GFP_EM(UNMAPPED)
+#else
+# define TRACE_GFP_FLAGS_UNMAPPED
+#endif
+
 #define TRACE_GFP_FLAGS				\
 	TRACE_GFP_FLAGS_GENERAL			\
 	TRACE_GFP_FLAGS_KASAN			\
 	TRACE_GFP_FLAGS_LOCKDEP			\
-	TRACE_GFP_FLAGS_SLAB
+	TRACE_GFP_FLAGS_SLAB			\
+	TRACE_GFP_FLAGS_UNMAPPED
 
 #undef TRACE_GFP_EM
 #define TRACE_GFP_EM(a) TRACE_DEFINE_ENUM(___GFP_##a##_BIT);
diff --git a/mm/Kconfig b/mm/Kconfig
index e98db58d515fc..b915af74d33cc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1506,3 +1506,7 @@ config MERMAP_KUNIT_TEST
 	  If unsure, say N.
 
 endmenu
+
+config PAGE_ALLOC_UNMAPPED
+	bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
+	default COMPILE_TEST

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 13/22] mm: rejig pageblock mask definitions
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (11 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

A later patch will complicate the definition of these masks, this is a
preparatory patch to make that patch easier to review.

- More masks will be needed, so add a PAGEBLOCK_ prefix to the names
  to avoid polluting the "global namespace" too much.

- This makes MIGRATETYPE_AND_ISO_MASK start to look pretty long. Well,
  that global mask only exists for quite a specific purpose so just drop
  it and take advantage of the newly-defined PAGEBLOCK_ISO_MASK.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/pageblock-flags.h |  6 +++---
 mm/page_alloc.c                 | 12 ++++++------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index e046278a01fa8..9a6c3ea17684d 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -36,12 +36,12 @@ enum pageblock_bits {
 
 #define NR_PAGEBLOCK_BITS (roundup_pow_of_two(__NR_PAGEBLOCK_BITS))
 
-#define MIGRATETYPE_MASK (BIT(PB_migrate_0)|BIT(PB_migrate_1)|BIT(PB_migrate_2))
+#define PAGEBLOCK_MIGRATETYPE_MASK (BIT(PB_migrate_0)|BIT(PB_migrate_1)|BIT(PB_migrate_2))
 
 #ifdef CONFIG_MEMORY_ISOLATION
-#define MIGRATETYPE_AND_ISO_MASK (MIGRATETYPE_MASK | BIT(PB_migrate_isolate))
+#define PAGEBLOCK_ISO_MASK	BIT(PB_migrate_isolate)
 #else
-#define MIGRATETYPE_AND_ISO_MASK MIGRATETYPE_MASK
+#define PAGEBLOCK_ISO_MASK	0
 #endif
 
 #if defined(CONFIG_HUGETLB_PAGE)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 018622aa19006..d16f7b7bdf282 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -362,7 +362,7 @@ get_pfnblock_bitmap_bitidx(const struct page *page, unsigned long pfn,
 #else
 	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 #endif
-	BUILD_BUG_ON(__MIGRATE_TYPE_END > MIGRATETYPE_MASK);
+	BUILD_BUG_ON(__MIGRATE_TYPE_END > PAGEBLOCK_MIGRATETYPE_MASK);
 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
 
 	bitmap = get_pageblock_bitmap(page, pfn);
@@ -466,7 +466,7 @@ get_pfnblock_freetype(const struct page *page, unsigned long pfn)
 __always_inline enum migratetype
 get_pfnblock_migratetype(const struct page *page, unsigned long pfn)
 {
-	unsigned long mask = MIGRATETYPE_AND_ISO_MASK;
+	unsigned long mask = PAGEBLOCK_MIGRATETYPE_MASK | PAGEBLOCK_ISO_MASK;
 	unsigned long flags;
 
 	flags = __get_pfnblock_flags_mask(page, pfn, mask);
@@ -475,7 +475,7 @@ get_pfnblock_migratetype(const struct page *page, unsigned long pfn)
 	if (flags & BIT(PB_migrate_isolate))
 		return MIGRATE_ISOLATE;
 #endif
-	return flags & MIGRATETYPE_MASK;
+	return flags & PAGEBLOCK_MIGRATETYPE_MASK;
 }
 
 /**
@@ -563,11 +563,11 @@ static void set_pageblock_migratetype(struct page *page,
 	}
 	VM_WARN_ONCE(get_pageblock_isolate(page),
 		     "Use clear_pageblock_isolate() to unisolate pageblock");
-	/* MIGRATETYPE_AND_ISO_MASK clears PB_migrate_isolate if it is set */
+	/* PAGEBLOCK_ISO_MASK clears PB_migrate_isolate if it is set */
 #endif
 	__set_pfnblock_flags_mask(page, page_to_pfn(page),
 				  (unsigned long)migratetype,
-				  MIGRATETYPE_AND_ISO_MASK);
+				  PAGEBLOCK_MIGRATETYPE_MASK | PAGEBLOCK_ISO_MASK);
 }
 
 void __meminit init_pageblock_migratetype(struct page *page,
@@ -593,7 +593,7 @@ void __meminit init_pageblock_migratetype(struct page *page,
 		flags |= BIT(PB_migrate_isolate);
 #endif
 	__set_pfnblock_flags_mask(page, page_to_pfn(page), flags,
-				  MIGRATETYPE_AND_ISO_MASK);
+				  PAGEBLOCK_MIGRATETYPE_MASK | PAGEBLOCK_ISO_MASK);
 }
 
 #ifdef CONFIG_DEBUG_VM

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 14/22] mm: encode freetype flags in pageblock flags
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (12 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

In preparation for implementing allocation from FREETYPE_UNMAPPED lists.

Since it works nicely with the existing allocator logic, and also offers
a simple way to amortize TLB flushing costs, __GFP_UNMAPPED will be
implemented by changing mappings at pageblock granularity. Therefore,
encode the mapping state in the pageblock flags.

Also add the necessary logic to record this from a freetype, and
reconstruct a freetype from the pageblock flags.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/pageblock-flags.h | 10 ++++++++++
 mm/page_alloc.c                 | 33 +++++++++++++++++++++++++--------
 2 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 9a6c3ea17684d..b634280050071 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -11,6 +11,8 @@
 #ifndef PAGEBLOCK_FLAGS_H
 #define PAGEBLOCK_FLAGS_H
 
+#include <linux/freetype.h>
+#include <linux/log2.h>
 #include <linux/types.h>
 
 /* Bit indices that affect a whole block of pages */
@@ -18,6 +20,9 @@ enum pageblock_bits {
 	PB_migrate_0,
 	PB_migrate_1,
 	PB_migrate_2,
+	PB_freetype_flags,
+	PB_freetype_flags_end = PB_freetype_flags + NUM_FREETYPE_FLAGS - 1,
+
 	PB_compact_skip,/* If set the block is skipped by compaction */
 
 #ifdef CONFIG_MEMORY_ISOLATION
@@ -37,6 +42,7 @@ enum pageblock_bits {
 #define NR_PAGEBLOCK_BITS (roundup_pow_of_two(__NR_PAGEBLOCK_BITS))
 
 #define PAGEBLOCK_MIGRATETYPE_MASK (BIT(PB_migrate_0)|BIT(PB_migrate_1)|BIT(PB_migrate_2))
+#define PAGEBLOCK_FREETYPE_FLAGS_MASK (((1 << NUM_FREETYPE_FLAGS) - 1) << PB_freetype_flags)
 
 #ifdef CONFIG_MEMORY_ISOLATION
 #define PAGEBLOCK_ISO_MASK	BIT(PB_migrate_isolate)
@@ -44,6 +50,10 @@ enum pageblock_bits {
 #define PAGEBLOCK_ISO_MASK	0
 #endif
 
+#define PAGEBLOCK_FREETYPE_MASK (PAGEBLOCK_MIGRATETYPE_MASK | \
+				 PAGEBLOCK_ISO_MASK | \
+				 PAGEBLOCK_FREETYPE_FLAGS_MASK)
+
 #if defined(CONFIG_HUGETLB_PAGE)
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d16f7b7bdf282..994ddcb132aed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -357,11 +357,8 @@ get_pfnblock_bitmap_bitidx(const struct page *page, unsigned long pfn,
 	unsigned long *bitmap;
 	unsigned long word_bitidx;
 
-#ifdef CONFIG_MEMORY_ISOLATION
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8);
-#else
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
-#endif
+	/* NR_PAGEBLOCK_BITS must divide word size. */
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4 && NR_PAGEBLOCK_BITS != 8);
 	BUILD_BUG_ON(__MIGRATE_TYPE_END > PAGEBLOCK_MIGRATETYPE_MASK);
 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
 
@@ -434,9 +431,20 @@ __always_inline freetype_t
 __get_pfnblock_freetype(const struct page *page, unsigned long pfn,
 			bool ignore_iso)
 {
-	int mt = get_pfnblock_migratetype(page, pfn);
+	unsigned long mask = PAGEBLOCK_FREETYPE_MASK;
+	enum migratetype migratetype;
+	unsigned int ft_flags;
+	unsigned long flags;
 
-	return migrate_to_freetype(mt, 0);
+	flags = __get_pfnblock_flags_mask(page, pfn, mask);
+	ft_flags = (flags & PAGEBLOCK_FREETYPE_FLAGS_MASK) >> PB_freetype_flags;
+
+	migratetype = flags & PAGEBLOCK_MIGRATETYPE_MASK;
+#ifdef CONFIG_MEMORY_ISOLATION
+	if (!ignore_iso && flags & BIT(PB_migrate_isolate))
+		migratetype = MIGRATE_ISOLATE;
+#endif
+	return migrate_to_freetype(migratetype, ft_flags);
 }
 
 /**
@@ -570,6 +578,15 @@ static void set_pageblock_migratetype(struct page *page,
 				  PAGEBLOCK_MIGRATETYPE_MASK | PAGEBLOCK_ISO_MASK);
 }
 
+static inline void set_pageblock_freetype_flags(struct page *page,
+						unsigned int ft_flags)
+{
+	unsigned int flags = ft_flags << PB_freetype_flags;
+
+	__set_pfnblock_flags_mask(page, page_to_pfn(page), flags,
+				  PAGEBLOCK_FREETYPE_FLAGS_MASK);
+}
+
 void __meminit init_pageblock_migratetype(struct page *page,
 					  enum migratetype migratetype,
 					  bool isolate)
@@ -593,7 +610,7 @@ void __meminit init_pageblock_migratetype(struct page *page,
 		flags |= BIT(PB_migrate_isolate);
 #endif
 	__set_pfnblock_flags_mask(page, page_to_pfn(page), flags,
-				  PAGEBLOCK_MIGRATETYPE_MASK | PAGEBLOCK_ISO_MASK);
+				  PAGEBLOCK_FREETYPE_MASK);
 }
 
 #ifdef CONFIG_DEBUG_VM

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (13 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

The ifdefs are not technically needed here, everything used here is
always defined.

They aren't doing much harm right now but a following patch will
complicate these functions. Switching to IS_ENABLED() makes the code a
bit less tiresome to read.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/page_alloc.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 994ddcb132aed..f125eae790f73 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -696,19 +696,17 @@ static void bad_page(struct page *page, const char *reason)
 
 static inline unsigned int order_to_pindex(int migratetype, int order)
 {
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		bool movable = migratetype == MIGRATE_MOVABLE;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	bool movable;
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		VM_BUG_ON(!is_pmd_order(order));
+		if (order > PAGE_ALLOC_COSTLY_ORDER) {
+			VM_BUG_ON(!is_pmd_order(order));
 
-		movable = migratetype == MIGRATE_MOVABLE;
-
-		return NR_LOWORDER_PCP_LISTS + movable;
+			return NR_LOWORDER_PCP_LISTS + movable;
+		}
+	} else {
+		VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
 	}
-#else
-	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-#endif
 
 	return (MIGRATE_PCPTYPES * order) + migratetype;
 }
@@ -717,12 +715,12 @@ static inline int pindex_to_order(unsigned int pindex)
 {
 	int order = pindex / MIGRATE_PCPTYPES;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (pindex >= NR_LOWORDER_PCP_LISTS)
-		order = HPAGE_PMD_ORDER;
-#else
-	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-#endif
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (pindex >= NR_LOWORDER_PCP_LISTS)
+			order = HPAGE_PMD_ORDER;
+	} else {
+		VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
+	}
 
 	return order;
 }

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (14 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

The normal freelists are already separated by this flag, so now update
the pcplists accordingly. This follows the most "obvious" design where
__GFP_UNMAPPED is supported at arbitrary orders.

If necessary, it would be possible to avoid the proliferation of
pcplists by restricting orders that can be allocated from them with this
FREETYPE_UNMAPPED.

On the other hand, there's currently no usecase for movable/reclaimable
unmapped memory, and constraining the migratetype doesn't have any
tricky plumbing implications. So, take advantage of that and assume that
FREETYPE_UNMAPPED implies MIGRATE_UNMOVABLE.

Overall, this just takes the existing space of pindices and tacks
another bank on the end. For !THP this is just 4 more lists, with THP
there is a single additional list for hugepages.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/mmzone.h | 11 ++++++++++-
 mm/page_alloc.c        | 44 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af662e4912591..65efc08152b0c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -778,8 +778,17 @@ enum zone_watermarks {
 #else
 #define NR_PCP_THP 0
 #endif
+/*
+ * FREETYPE_UNMAPPED can currently only be used with MIGRATE_UNMOVABLE, no for
+ * those there's no need to encode the migratetype in the pindex.
+ */
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+#define NR_UNMAPPED_PCP_LISTS (PAGE_ALLOC_COSTLY_ORDER + 1 + !!NR_PCP_THP)
+#else
+#define NR_UNMAPPED_PCP_LISTS 0
+#endif
 #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
-#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
+#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP + NR_UNMAPPED_PCP_LISTS)
 
 /*
  * Flags used in pcp->flags field.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f125eae790f73..53848312a0c21 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -694,18 +694,30 @@ static void bad_page(struct page *page, const char *reason)
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
-static inline unsigned int order_to_pindex(int migratetype, int order)
+static inline unsigned int order_to_pindex(freetype_t freetype, int order)
 {
+	int migratetype = free_to_migratetype(freetype);
+
+	VM_BUG_ON(migratetype >= MIGRATE_PCPTYPES);
+	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER &&
+		(!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || order != HPAGE_PMD_ORDER));
+
+	/* FREETYPE_UNMAPPED currently always means MIGRATE_UNMOVABLE. */
+	if (freetype_flags(freetype) & FREETYPE_UNMAPPED) {
+		int order_offset = order;
+
+		VM_BUG_ON(migratetype != MIGRATE_UNMOVABLE);
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			order_offset = PAGE_ALLOC_COSTLY_ORDER + 1;
+
+		return NR_LOWORDER_PCP_LISTS + NR_PCP_THP + order_offset;
+	}
+
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 		bool movable = migratetype == MIGRATE_MOVABLE;
 
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			VM_BUG_ON(!is_pmd_order(order));
-
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			return NR_LOWORDER_PCP_LISTS + movable;
-		}
-	} else {
-		VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
 	}
 
 	return (MIGRATE_PCPTYPES * order) + migratetype;
@@ -713,8 +725,18 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
 
 static inline int pindex_to_order(unsigned int pindex)
 {
-	int order = pindex / MIGRATE_PCPTYPES;
+	unsigned int unmapped_base = NR_LOWORDER_PCP_LISTS + NR_PCP_THP;
+	int order;
 
+	if (pindex >= unmapped_base) {
+		order = pindex - unmapped_base;
+		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+		    order > PAGE_ALLOC_COSTLY_ORDER)
+			return HPAGE_PMD_ORDER;
+		return order;
+	}
+
+	order = pindex / MIGRATE_PCPTYPES;
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 		if (pindex >= NR_LOWORDER_PCP_LISTS)
 			order = HPAGE_PMD_ORDER;
@@ -2935,7 +2957,7 @@ static bool free_frozen_page_commit(struct zone *zone,
 	 */
 	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
-	pindex = order_to_pindex(free_to_migratetype(freetype), order);
+	pindex = order_to_pindex(freetype, order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
@@ -3452,7 +3474,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 * frees.
 	 */
 	pcp->free_count >>= 1;
-	list = &pcp->lists[order_to_pindex(free_to_migratetype(freetype), order)];
+	list = &pcp->lists[order_to_pindex(freetype, order)];
 	page = __rmqueue_pcplist(zone, order, freetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	if (page) {
@@ -5236,7 +5258,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	pcp_list = &pcp->lists[order_to_pindex(free_to_migratetype(ac.freetype), 0)];
+	pcp_list = &pcp->lists[order_to_pindex(ac.freetype, 0)];
 	while (nr_populated < nr_pages) {
 
 		/* Skip existing pages */

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (15 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Commit 1ebbb21811b7 ("mm/page_alloc: explicitly define how __GFP_HIGH
non-blocking allocations accesses reserves") renamed ALLOC_HARDER to
ALLOC_NON_BLOCK because the former is "a vague description".

However, vagueness is accurate here, this is a vague flag. It is not set
for __GFP_NOMEMALLOC. It doesn't really mean "allocate without blocking"
but rather "allow dipping into atomic reserves, _because_ of the need
not to block".

A later commit will need an alloc flag that really means "don't block
here", so go back to the flag's old name and update the commentary
to try and give it a slightly clearer meaning.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/internal.h   | 9 +++++----
 mm/page_alloc.c | 8 ++++----
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index d929274d73b92..cc19a90a7933f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1413,9 +1413,10 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_OOM		ALLOC_NO_WATERMARKS
 #endif
 
-#define ALLOC_NON_BLOCK		 0x10 /* Caller cannot block. Allow access
-				       * to 25% of the min watermark or
-				       * 62.5% if __GFP_HIGH is set.
+#define ALLOC_HARDER		 0x10 /* Because the caller cannot block,
+				       * allow access * to 25% of the min
+				       * watermark or 62.5% if __GFP_HIGH is
+				       * set.
 				       */
 #define ALLOC_MIN_RESERVE	 0x20 /* __GFP_HIGH set. Allow access to 50%
 				       * of the min watermark.
@@ -1432,7 +1433,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
-#define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
+#define ALLOC_RESERVES (ALLOC_HARDER|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53848312a0c21..9a07c552a1f8a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3355,7 +3355,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			 * reserves as failing now is worse than failing a
 			 * high-order atomic allocation in the future.
 			 */
-			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
+			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_HARDER)))
 				page = __rmqueue_smallest(zone, order, ft_high);
 
 			if (!page) {
@@ -3717,7 +3717,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			 * or (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) do not get
 			 * access to the min reserve.
 			 */
-			if (alloc_flags & ALLOC_NON_BLOCK)
+			if (alloc_flags & ALLOC_HARDER)
 				min -= min / 4;
 		}
 
@@ -4602,7 +4602,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
+	 * set both ALLOC_HARDER and ALLOC_MIN_RESERVE(__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
@@ -4613,7 +4613,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 		 * if it can't schedule.
 		 */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-			alloc_flags |= ALLOC_NON_BLOCK;
+			alloc_flags |= ALLOC_HARDER;
 
 			if (order > 0 && (alloc_flags & ALLOC_MIN_RESERVE))
 				alloc_flags |= ALLOC_HIGHATOMIC;

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (16 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

This flag is set unless we can be sure the caller isn't in an atomic
context.

The allocator will soon start needing to call set_direct_map_* APIs
which cannot be called with IRQs off. It will need to do this even
before direct reclaim is possible.

Despite the fact that, in principle, ALLOC_NOBLOCK is distinct from
__GFP_DIRECT_RECLAIM, in order to avoid introducing a GFP flag, just
infer the former based on whether the caller set the latter. This means
that, in practice, ALLOC_NOBLOCK is just !__GFP_DIRECT_RECLAIM, except
that it is not influenced by gfp_allowed_mask. This could change later,
though.

Call it ALLOC_NOBLOCK in order to try and mitigate confusion vs the
recently-removed ALLOC_NON_BLOCK, which meant something different.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 29 ++++++++++++++++++++++-------
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cc19a90a7933f..865991aca06ea 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1431,6 +1431,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define ALLOC_NOBLOCK	       0x1000 /* Caller may be atomic */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_HARDER|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a07c552a1f8a..83d06a6db6433 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4608,6 +4608,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
 
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
+		alloc_flags |= ALLOC_NOBLOCK;
+
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
@@ -4801,14 +4803,13 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
-						struct alloc_context *ac)
+		       struct alloc_context *ac, unsigned int alloc_flags)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	bool can_compact = can_direct_reclaim && gfp_compaction_allowed(gfp_mask);
 	bool nofail = gfp_mask & __GFP_NOFAIL;
 	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
 	struct page *page = NULL;
-	unsigned int alloc_flags;
 	unsigned long did_some_progress;
 	enum compact_priority compact_priority;
 	enum compact_result compact_result;
@@ -4860,7 +4861,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * kswapd needs to be woken up, and to avoid the cost of setting up
 	 * alloc_flags precisely. So we do that now.
 	 */
-	alloc_flags = gfp_to_alloc_flags(gfp_mask, order);
+	alloc_flags |= gfp_to_alloc_flags(gfp_mask, order);
 
 	/*
 	 * We need to recalculate the starting point for the zonelist iterator
@@ -5086,6 +5087,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static inline unsigned int init_alloc_flags(gfp_t gfp_mask, unsigned int flags)
+{
+	/*
+	 * If the caller allowed __GFP_DIRECT_RECLAIM, they can't be atomic.
+	 * Note this is a separate determination from whether direct reclaim is
+	 * actually allowed, it must happen before applying gfp_allowed_mask.
+	 */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		flags |= ALLOC_NOBLOCK;
+	return flags;
+}
+
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_gfp,
@@ -5166,7 +5179,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	struct list_head *pcp_list;
 	struct alloc_context ac;
 	gfp_t alloc_gfp;
-	unsigned int alloc_flags = ALLOC_WMARK_LOW;
+	unsigned int alloc_flags = init_alloc_flags(gfp, ALLOC_WMARK_LOW);
 	int nr_populated = 0, nr_account = 0;
 
 	/*
@@ -5307,7 +5320,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask)
 {
 	struct page *page;
-	unsigned int alloc_flags = ALLOC_WMARK_LOW;
+	unsigned int alloc_flags = init_alloc_flags(gfp, ALLOC_WMARK_LOW);
 	gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
 
@@ -5352,7 +5365,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 	 */
 	ac.nodemask = nodemask;
 
-	page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
+	page = __alloc_pages_slowpath(alloc_gfp, order, &ac, alloc_flags);
 
 out:
 	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
@@ -7872,11 +7885,13 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
 	 */
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
 			| gfp_flags;
-	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	unsigned int alloc_flags = init_alloc_flags(alloc_gfp, ALLOC_TRYLOCK);
 	struct alloc_context ac = { };
 	struct page *page;
 
 	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
+	VM_WARN_ON_ONCE(!(alloc_flags & ALLOC_NOBLOCK));
+
 	/*
 	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
 	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (17 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Currently __GFP_UNMAPPED allocs will always fail because, although the
lists exist to hold them, there is no way to actually create an unmapped
page block. This commit adds one, and also the logic to map it back
again when that's needed.

Doing this at pageblock granularity ensures that the pageblock flags can
be used to infer which freetype a page belongs to. It also provides nice
batching of TLB flushes, and also avoids creating too much unnecessary
TLB fragmentation in the physmap.

There are some functional requirements for flipping a block:

 - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.

 - Because the main usecase of this feature is to protect against CPU
   exploits, when a block is mapped it needs to be zeroed to ensure no
   residual data is available to attackers. Zeroing a block with a
   spinlock held seems undesirable.

 - Updating the pagetables might require allocating a pagetable to break
   down a huge page. This would deadlock if the zone lock was held.

This makes allocations that need to change sensitivity _somewhat_
similar to those that need to fallback to a different migratetype. But,
the locking requirements mean that this can't just be squashed into the
existing "fallback" allocator logic, instead a new allocator path just
for this purpose is needed.

The new path is assumed to be much cheaper than the really heavyweight
stuff like compaction and reclaim. But at present it is treated as less
desirable than the mobility-related "fallback" and "stealing" logic.
This might turn out to need revision (in particular, maybe it's a
problem that __rmqueue_steal(), which causes fragmentation, happens
before __rmqueue_direct_map()), but that should be treated as a subsequent
optimisation project.

This currently forbids __GFP_ZERO, this is just to keep the patch from
getting too large, the next patch will remove this restriction.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 include/linux/gfp.h |  11 +++-
 mm/Kconfig          |   4 +-
 mm/page_alloc.c     | 171 ++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 170 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 34a38c420e84a..2d8279c6300d3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -24,6 +24,7 @@ struct mempolicy;
 static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
 {
 	int migratetype;
+	unsigned int ft_flags = 0;
 
 	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
@@ -40,7 +41,15 @@ static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
 			>> GFP_MOVABLE_SHIFT;
 	}
 
-	return migrate_to_freetype(migratetype, 0);
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+	if (gfp_flags & __GFP_UNMAPPED) {
+		if (WARN_ON_ONCE(migratetype != MIGRATE_UNMOVABLE))
+			migratetype = MIGRATE_UNMOVABLE;
+		ft_flags |= FREETYPE_UNMAPPED;
+	}
+#endif
+
+	return migrate_to_freetype(migratetype, ft_flags);
 }
 #undef GFP_MOVABLE_MASK
 #undef GFP_MOVABLE_SHIFT
diff --git a/mm/Kconfig b/mm/Kconfig
index b915af74d33cc..e4cb52149acad 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1505,8 +1505,8 @@ config MERMAP_KUNIT_TEST
 
 	  If unsure, say N.
 
-endmenu
-
 config PAGE_ALLOC_UNMAPPED
 	bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
 	default COMPILE_TEST
+
+endmenu
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 83d06a6db6433..710ee9f46d467 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -34,6 +34,7 @@
 #include <linux/folio_batch.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
+#include <linux/set_memory.h>
 #include <linux/vmstat.h>
 #include <linux/fault-inject.h>
 #include <linux/compaction.h>
@@ -1002,6 +1003,26 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
+/*
+ * Can pages of these two freetypes be combined into a single higher-order free
+ * page?
+ */
+static inline bool can_merge_freetypes(freetype_t a, freetype_t b)
+{
+	if (freetypes_equal(a, b))
+		return true;
+
+	if (!migratetype_is_mergeable(free_to_migratetype(a)) ||
+	    !migratetype_is_mergeable(free_to_migratetype(b)))
+		return false;
+
+	/*
+	 * Mustn't "just" merge pages with different freetype flags, changing
+	 * those requires updating pagetables.
+	 */
+	return freetype_flags(a) == freetype_flags(b);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -1070,9 +1091,7 @@ static inline void __free_one_page(struct page *page,
 			buddy_ft = get_pfnblock_freetype(buddy, buddy_pfn);
 			buddy_mt = free_to_migratetype(buddy_ft);
 
-			if (migratetype != buddy_mt &&
-			    (!migratetype_is_mergeable(migratetype) ||
-			     !migratetype_is_mergeable(buddy_mt)))
+			if (!can_merge_freetypes(freetype, buddy_ft))
 				goto done_merging;
 		}
 
@@ -1089,7 +1108,9 @@ static inline void __free_one_page(struct page *page,
 			/*
 			 * Match buddy type. This ensures that an
 			 * expand() down the line puts the sub-blocks
-			 * on the right freelists.
+			 * on the right freelists. Freetype flags are
+			 * already set correctly because of
+			 * can_merge_freetypes().
 			 */
 			change_pageblock_range(buddy, order, migratetype);
 		}
@@ -1982,6 +2003,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	struct free_area *area;
 	struct page *page;
 
+	if (freetype_idx(freetype) < 0)
+		return NULL;
+
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
 		enum migratetype migratetype = free_to_migratetype(freetype);
@@ -3324,6 +3348,119 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+/* Try to allocate a page by mapping/unmapping a block from the direct map. */
+static inline struct page *
+__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
+		     unsigned int alloc_flags, freetype_t freetype)
+{
+	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
+	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
+						  ft_flags_other);
+	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
+	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
+	unsigned long irq_flags;
+	int nr_pageblocks;
+	struct page *page;
+	int alloc_order;
+	int err;
+
+	if (freetype_idx(ft_other) < 0)
+		return NULL;
+
+	/*
+	 * Might need a TLB shootdown. Even if IRQs are on this isn't
+	 * safe if the caller holds a lock (in case the other CPUs need that
+	 * lock to handle the shootdown IPI).
+	 */
+	if (alloc_flags & ALLOC_NOBLOCK)
+		return NULL;
+
+	if (!can_set_direct_map())
+		return NULL;
+
+	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
+
+	/*
+	 * Need to [un]map a whole pageblock (otherwise it might require
+	 * allocating pagetables). First allocate it.
+	 */
+	alloc_order = max(request_order, pageblock_order);
+	nr_pageblocks = 1 << (alloc_order - pageblock_order);
+	zone_lock_irqsave(zone, irq_flags);
+	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
+	zone_unlock_irqrestore(zone, irq_flags);
+	if (!page)
+		return NULL;
+
+	/*
+	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
+	 * released the zone lock it's possible to allocate a pagetable if
+	 * needed to split up a huge page.
+	 *
+	 * Note that modifying the direct map may need to allocate pagetables.
+	 * What about unbounded recursion? Here are the assumptions that make it
+	 * safe:
+	 *
+	 * - The direct map starts out fully mapped at boot. (This is not really
+	 *   an assumption" as its in direct control of page_alloc.c).
+	 *
+	 * - Once pages in the direct map are broken down, they are not
+	 *   re-aggregated into larger pages again.
+	 *
+	 * - Pagetables are never allocated with __GFP_UNMAPPED.
+	 *
+	 * Under these assumptions, a pagetable might need to be allocated while
+	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
+	 * allocation. But, the allocation of that pagetable never requires
+	 * allocating a further pagetable.
+	 */
+	err = set_direct_map_valid_noflush(page,
+				nr_pageblocks << pageblock_order, want_mapped);
+	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
+		zone_lock_irqsave(zone, irq_flags);
+		__free_one_page(page, page_to_pfn(page), zone,
+				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
+		zone_unlock_irqrestore(zone, irq_flags);
+		return NULL;
+	}
+
+	if (!want_mapped) {
+		unsigned long start = (unsigned long)page_address(page);
+		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
+
+		flush_tlb_kernel_range(start, end);
+	}
+
+	for (int i = 0; i < nr_pageblocks; i++) {
+		struct page *block_page = page + (pageblock_nr_pages * i);
+
+		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
+	}
+
+	if (request_order >= alloc_order)
+		return page;
+
+	/* Free any remaining pages in the block. */
+	zone_lock_irqsave(zone, irq_flags);
+	for (unsigned int i = request_order; i < alloc_order; i++) {
+		struct page *page_to_free = page + (1 << i);
+
+		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
+			i, freetype, FPI_SKIP_REPORT_NOTIFY);
+	}
+	zone_unlock_irqrestore(zone, irq_flags);
+
+	return page;
+}
+#else /* CONFIG_PAGE_ALLOC_UNMAPPED */
+static inline struct page *__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
+				unsigned int alloc_flags, freetype_t freetype)
+{
+	return NULL;
+}
+#endif /* CONFIG_PAGE_ALLOC_UNMAPPED */
+
 static __always_inline
 struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			   unsigned int order, unsigned int alloc_flags,
@@ -3331,8 +3468,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 {
 	struct page *page;
 	unsigned long flags;
-	freetype_t ft_high = freetype_with_migrate(freetype,
-						       MIGRATE_HIGHATOMIC);
+	freetype_t ft_high = freetype_with_migrate(freetype, MIGRATE_HIGHATOMIC);
 
 	do {
 		page = NULL;
@@ -3357,13 +3493,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			 */
 			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_HARDER)))
 				page = __rmqueue_smallest(zone, order, ft_high);
-
-			if (!page) {
-				zone_unlock_irqrestore(zone, flags);
-				return NULL;
-			}
 		}
 		zone_unlock_irqrestore(zone, flags);
+
+		/* Try changing direct map, now we've released the zone lock */
+		if (!page)
+			page = __rmqueue_direct_map(zone, order, alloc_flags, freetype);
+		if (!page)
+			return NULL;
+
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3587,6 +3725,8 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 						bool force)
 {
+	freetype_t ft_high = freetype_with_migrate(ac->freetype,
+					MIGRATE_HIGHATOMIC);
 	struct zonelist *zonelist = ac->zonelist;
 	unsigned long flags;
 	struct zoneref *z;
@@ -3595,6 +3735,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 	int order;
 	int ret;
 
+	if (freetype_idx(ft_high) < 0)
+		return false;
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
 								ac->nodemask) {
 		/*
@@ -3608,8 +3751,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		zone_lock_irqsave(zone, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
-			freetype_t ft_high = freetype_with_migrate(ac->freetype,
-							MIGRATE_HIGHATOMIC);
 			unsigned long size;
 
 			page = get_page_from_free_area(area, ft_high);
@@ -5109,6 +5250,10 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	ac->nodemask = nodemask;
 	ac->freetype = gfp_freetype(gfp_mask);
 
+	/* Not implemented yet. */
+	if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED && gfp_mask & __GFP_ZERO)
+		return false;
+
 	if (cpusets_enabled()) {
 		*alloc_gfp |= __GFP_HARDWALL;
 		/*

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (18 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

The pages being zeroed here are unmapped, so they can't be zeroed via
the direct map. Temporarily mapping them in the direct map is not
possible because:

- In general this requires allocating pagetables,

- Unmapping them would require a TLB shootdown, which can't be done in
  general from the allocator (x86 requires IRQs on).

Therefore, use the new mermap mechanism to zero these pages.

The main mermap API is expected to fail very often. In order to avoid
needing to fail allocations when that happens, instead fallback to the
special mermap_get_reserved() variant, which is less efficient.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 arch/x86/include/asm/pgtable_types.h |  2 +
 mm/Kconfig                           | 11 +++++-
 mm/page_alloc.c                      | 76 +++++++++++++++++++++++++++++++-----
 3 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 2ec250ba467e2..c3d73bdfff1fa 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -223,6 +223,7 @@ enum page_cache_mode {
 #define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_NOGLOBAL	 (__PP|__RW|   0|___A|__NX|___D|   0|   0)
 #define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
 #define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
@@ -245,6 +246,7 @@ enum page_cache_mode {
 #define __pgprot_mask(x)	__pgprot((x) & __default_kernel_pte_mask)
 
 #define PAGE_KERNEL		__pgprot_mask(__PAGE_KERNEL            | _ENC)
+#define PAGE_KERNEL_NOGLOBAL	__pgprot_mask(__PAGE_KERNEL_NOGLOBAL   | _ENC)
 #define PAGE_KERNEL_NOENC	__pgprot_mask(__PAGE_KERNEL            |    0)
 #define PAGE_KERNEL_RO		__pgprot_mask(__PAGE_KERNEL_RO         | _ENC)
 #define PAGE_KERNEL_EXEC	__pgprot_mask(__PAGE_KERNEL_EXEC       | _ENC)
diff --git a/mm/Kconfig b/mm/Kconfig
index e4cb52149acad..05b2bb841d0e0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1506,7 +1506,14 @@ config MERMAP_KUNIT_TEST
 	  If unsure, say N.
 
 config PAGE_ALLOC_UNMAPPED
-	bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
-	default COMPILE_TEST
+	bool "Support allocating pages that aren't in the direct map"
+	depends on MERMAP
+
+config PAGE_ALLOC_KUNIT_TESTS
+	tristate "KUnit tests for the page allocator" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Builds KUnit tests for the page allocator.
 
 endmenu
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 710ee9f46d467..7c91dcbe32576 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -14,6 +14,7 @@
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
  */
 
+#include <linux/mermap.h>
 #include <linux/stddef.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
@@ -1327,15 +1328,72 @@ static inline bool should_skip_kasan_poison(struct page *page)
 	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
 }
 
-static void kernel_init_pages(struct page *page, int numpages)
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+static inline bool pageblock_unmapped(struct page *page)
 {
-	int i;
+	return freetype_flags(get_pageblock_freetype(page)) & FREETYPE_UNMAPPED;
+}
 
-	/* s390's use of memset() could override KASAN redzones. */
-	kasan_disable_current();
-	for (i = 0; i < numpages; i++)
-		clear_highpage_kasan_tagged(page + i);
-	kasan_enable_current();
+static inline void clear_page_mermap(struct page *page, unsigned int numpages)
+{
+	void *mermap;
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHMEM));
+
+	/* Fast path: single mapping (may fail under preemption). */
+	mermap = mermap_get(page, numpages << PAGE_SHIFT, PAGE_KERNEL_NOGLOBAL);
+	if (mermap) {
+		void *buf = kasan_reset_tag(mermap_addr(mermap));
+
+		for (int i = 0; i < numpages; i++)
+			clear_page(buf + (i << PAGE_SHIFT));
+		mermap_put(mermap);
+		return;
+	}
+
+	/* Slow path, map each page individually (always succeeds). */
+	for (int i = 0; i < numpages; i++) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		mermap = mermap_get_reserved(page + i, PAGE_KERNEL_NOGLOBAL);
+		clear_page(kasan_reset_tag(mermap_addr(mermap)));
+		mermap_put(mermap);
+		local_irq_restore(flags);
+	}
+}
+#else
+static inline bool pageblock_unmapped(struct page *page)
+{
+	return false;
+}
+
+static inline void clear_page_mermap(struct page *page, unsigned int numpages)
+{
+	BUG();
+}
+#endif
+
+static void kernel_init_pages(struct page *page, unsigned int numpages)
+{
+	int num_blocks = DIV_ROUND_UP(numpages, pageblock_nr_pages);
+
+	for (int block = 0; block < num_blocks; block++) {
+		struct page *block_page = page + (block << pageblock_order);
+		bool unmapped = pageblock_unmapped(block_page);
+
+		/* s390's use of memset() could override KASAN redzones. */
+		kasan_disable_current();
+		if (unmapped) {
+			clear_page_mermap(block_page, numpages);
+		} else {
+			for (int i = 0; i < min(numpages, pageblock_nr_pages); i++)
+				clear_highpage_kasan_tagged(block_page + i);
+		}
+		kasan_enable_current();
+
+		numpages -= pageblock_nr_pages;
+	}
 }
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
@@ -5250,8 +5308,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	ac->nodemask = nodemask;
 	ac->freetype = gfp_freetype(gfp_mask);
 
-	/* Not implemented yet. */
-	if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED && gfp_mask & __GFP_ZERO)
+	if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED &&
+	    WARN_ON(!mermap_ready()))
 		return false;
 
 	if (cpusets_enabled()) {

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (19 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

Add a simple smoke test for __GFP_UNMAPPED that tries to exercise
flipping pageblocks between mapped/unmapped state.

Also add some basic tests for some freelist-indexing helpers.

Simplest way to run these on x86:

tools/testing/kunit/kunit.py run --arch=x86_64 "page_alloc.*" \
	--kconfig_add CONFIG_MERMAP=y --kconfig_add CONFIG_PAGE_ALLOC_UNMAPPED=y

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 kernel/panic.c              |   2 +
 mm/Kconfig                  |   2 +-
 mm/Makefile                 |   1 +
 mm/init-mm.c                |   3 +
 mm/internal.h               |   6 ++
 mm/page_alloc.c             |  11 +-
 mm/tests/page_alloc_kunit.c | 250 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 271 insertions(+), 4 deletions(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index 20feada5319d4..1301444a0447a 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -39,6 +39,7 @@
 #include <linux/sys_info.h>
 #include <trace/events/error_report.h>
 #include <asm/sections.h>
+#include <kunit/visibility.h>
 
 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -941,6 +942,7 @@ unsigned long get_taint(void)
 {
 	return tainted_mask;
 }
+EXPORT_SYMBOL_IF_KUNIT(get_taint);
 
 /**
  * add_taint: add a taint flag if not already set.
diff --git a/mm/Kconfig b/mm/Kconfig
index 05b2bb841d0e0..2021f52a0c422 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1509,7 +1509,7 @@ config PAGE_ALLOC_UNMAPPED
 	bool "Support allocating pages that aren't in the direct map"
 	depends on MERMAP
 
-config PAGE_ALLOC_KUNIT_TESTS
+config PAGE_ALLOC_KUNIT_TEST
 	tristate "KUnit tests for the page allocator" if !KUNIT_ALL_TESTS
 	depends on KUNIT
 	default KUNIT_ALL_TESTS
diff --git a/mm/Makefile b/mm/Makefile
index 93a1756303cf9..11849162b6f5a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -152,3 +152,4 @@ obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
 obj-$(CONFIG_MERMAP) += mermap.o
 obj-$(CONFIG_MERMAP_KUNIT_TEST) += tests/mermap_kunit.o
+obj-$(CONFIG_PAGE_ALLOC_KUNIT_TEST) += tests/page_alloc_kunit.o
diff --git a/mm/init-mm.c b/mm/init-mm.c
index c5556bb9d5f01..31103356da654 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -13,6 +13,8 @@
 #include <linux/iommu.h>
 #include <asm/mmu.h>
 
+#include <kunit/visibility.h>
+
 #ifndef INIT_MM_CONTEXT
 #define INIT_MM_CONTEXT(name)
 #endif
@@ -50,6 +52,7 @@ struct mm_struct init_mm = {
 	.flexible_array	= MM_STRUCT_FLEXIBLE_ARRAY_INIT,
 	INIT_MM_CONTEXT(init_mm)
 };
+EXPORT_SYMBOL_IF_KUNIT(init_mm);
 
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk)
diff --git a/mm/internal.h b/mm/internal.h
index 865991aca06ea..6c652148bc906 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1908,4 +1908,10 @@ int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			  unsigned long size, pte_fn_t fn,
 			  void *data, unsigned int flags);
 
+#if IS_ENABLED(CONFIG_KUNIT)
+unsigned int order_to_pindex(freetype_t freetype, int order);
+int pindex_to_order(unsigned int pindex);
+bool pcp_allowed_order(unsigned int order);
+#endif /* IS_ENABLED(CONFIG_KUNIT) */
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7c91dcbe32576..291ba32f1f1ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/pgalloc_tag.h>
 #include <linux/mmzone_lock.h>
 #include <asm/div64.h>
+#include <kunit/visibility.h>
 #include "internal.h"
 #include "shuffle.h"
 #include "page_reporting.h"
@@ -461,6 +462,7 @@ get_pfnblock_freetype(const struct page *page, unsigned long pfn)
 {
 	return __get_pfnblock_freetype(page, pfn, 0);
 }
+EXPORT_SYMBOL_IF_KUNIT(get_pfnblock_freetype);
 
 
 /**
@@ -696,7 +698,7 @@ static void bad_page(struct page *page, const char *reason)
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
-static inline unsigned int order_to_pindex(freetype_t freetype, int order)
+VISIBLE_IF_KUNIT inline unsigned int order_to_pindex(freetype_t freetype, int order)
 {
 	int migratetype = free_to_migratetype(freetype);
 
@@ -724,8 +726,9 @@ static inline unsigned int order_to_pindex(freetype_t freetype, int order)
 
 	return (MIGRATE_PCPTYPES * order) + migratetype;
 }
+EXPORT_SYMBOL_IF_KUNIT(order_to_pindex);
 
-static inline int pindex_to_order(unsigned int pindex)
+VISIBLE_IF_KUNIT int pindex_to_order(unsigned int pindex)
 {
 	unsigned int unmapped_base = NR_LOWORDER_PCP_LISTS + NR_PCP_THP;
 	int order;
@@ -748,8 +751,9 @@ static inline int pindex_to_order(unsigned int pindex)
 
 	return order;
 }
+EXPORT_SYMBOL_IF_KUNIT(pindex_to_order);
 
-static inline bool pcp_allowed_order(unsigned int order)
+VISIBLE_IF_KUNIT inline bool pcp_allowed_order(unsigned int order)
 {
 	if (order <= PAGE_ALLOC_COSTLY_ORDER)
 		return true;
@@ -759,6 +763,7 @@ static inline bool pcp_allowed_order(unsigned int order)
 #endif
 	return false;
 }
+EXPORT_SYMBOL_IF_KUNIT(pcp_allowed_order);
 
 /*
  * Higher-order pages are called "compound pages".  They are structured thusly:
diff --git a/mm/tests/page_alloc_kunit.c b/mm/tests/page_alloc_kunit.c
new file mode 100644
index 0000000000000..bd55d0bc35ac9
--- /dev/null
+++ b/mm/tests/page_alloc_kunit.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/mermap.h>
+#include <linux/mm_types.h>
+#include <linux/mm.h>
+#include <linux/pgtable.h>
+#include <linux/set_memory.h>
+#include <linux/sched/mm.h>
+#include <linux/types.h>
+#include <linux/vmalloc.h>
+
+#include <kunit/resource.h>
+#include <kunit/test.h>
+
+#include "internal.h"
+
+struct free_pages_ctx {
+	unsigned int order;
+	struct list_head pages;
+};
+
+static inline void action_many__free_pages(void *context)
+{
+	struct free_pages_ctx *ctx = context;
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, &ctx->pages, lru)
+		__free_pages(page, ctx->order);
+}
+
+/*
+ * Allocate a bunch of pages with the same order and GFP flags, transparently
+ * take care of error handling and cleanup. Does this all via a single KUnit
+ * resource, i.e. has a fixed memory overhead.
+ */
+static inline struct free_pages_ctx *
+do_many_alloc_pages(struct kunit *test, gfp_t gfp,
+		    unsigned int order, unsigned int count)
+{
+	struct free_pages_ctx *ctx = kunit_kzalloc(
+		test, sizeof(struct free_pages_ctx), GFP_KERNEL);
+
+	KUNIT_ASSERT_NOT_NULL(test, ctx);
+	INIT_LIST_HEAD(&ctx->pages);
+	ctx->order = order;
+
+	for (int i = 0; i < count; i++) {
+		struct page *page = alloc_pages(gfp, order);
+
+		if (!page) {
+			struct page *page, *tmp;
+
+			list_for_each_entry_safe(page, tmp, &ctx->pages, lru)
+				__free_pages(page, order);
+
+			KUNIT_FAIL_AND_ABORT(test,
+				"Failed to alloc order %d page (GFP *%pG) iter %d",
+				order, &gfp, i);
+		}
+		list_add(&page->lru, &ctx->pages);
+	}
+
+	KUNIT_ASSERT_EQ(test,
+		kunit_add_action_or_reset(test, action_many__free_pages, ctx), 0);
+	return ctx;
+}
+
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+
+static const gfp_t gfp_params_array[] = {
+	0,
+	__GFP_ZERO,
+};
+
+static void gfp_param_get_desc(const gfp_t *gfp, char *desc)
+{
+	snprintf(desc, KUNIT_PARAM_DESC_SIZE, "%pGg", gfp);
+}
+
+KUNIT_ARRAY_PARAM(gfp, gfp_params_array, gfp_param_get_desc);
+
+/* Do some allocations that force the allocator to map/unmap some blocks.  */
+static void test_alloc_map_unmap(struct kunit *test)
+{
+	unsigned long page_majority;
+	struct free_pages_ctx *ctx;
+	const gfp_t *gfp_extra = test->param_value;
+	gfp_t gfp = GFP_KERNEL | __GFP_THISNODE | __GFP_UNMAPPED | *gfp_extra;
+	struct page *page;
+
+	kunit_attach_mm();
+	mermap_mm_prepare(current->mm);
+
+	/* No cleanup here - assuming kthread "belongs" to this test. */
+	set_cpus_allowed_ptr(current, cpumask_of_node(numa_node_id()));
+
+	/*
+	 * First allocate more than half of the memory in the node as
+	 * unmapped. Assuming the memory starts out mapped, this should
+	 * exercise the unmap.
+	 */
+	page_majority = (node_present_pages(numa_node_id()) / 2) + 1;
+	ctx = do_many_alloc_pages(test, gfp, 0, page_majority);
+
+	/* Check pages are unmapped */
+	list_for_each_entry(page, &ctx->pages, lru) {
+		freetype_t ft = get_pfnblock_freetype(page, page_to_pfn(page));
+
+		/*
+		 * Logically it should be an EXPECT, but that would
+		 * cause heavy log spam on failure so use ASSERT for
+		 * concision.
+		 */
+		KUNIT_ASSERT_FALSE(test, kernel_page_present(page));
+		KUNIT_ASSERT_TRUE(test, freetype_flags(ft) & FREETYPE_UNMAPPED);
+	}
+
+	/*
+	 * Now free them again and allocate the same amount without
+	 * __GFP_UNMAPPED. This will exercise the mapping logic.
+	 */
+	kunit_release_action(test, action_many__free_pages, ctx);
+	gfp &= ~__GFP_UNMAPPED;
+	ctx = do_many_alloc_pages(test, gfp, 0, page_majority);
+
+	/* Check pages are mapped. */
+	list_for_each_entry(page, &ctx->pages, lru)
+		KUNIT_ASSERT_TRUE(test, kernel_page_present(page));
+}
+
+#endif /* CONFIG_PAGE_ALLOC_UNMAPPED */
+
+static void __test_pindex_helpers(struct kunit *test, unsigned long *bitmap,
+				  int mt, unsigned int ftflags, unsigned int order)
+{
+	freetype_t ft = migrate_to_freetype(mt, ftflags);
+	unsigned int pindex;
+	int got_order;
+
+	if (!pcp_allowed_order(order))
+		return;
+
+	if (mt >= MIGRATE_PCPTYPES)
+		return;
+
+	if (freetype_idx(ft) < 0)
+		return;
+
+	pindex = order_to_pindex(ft, order);
+
+	KUNIT_ASSERT_LT_MSG(test, pindex, NR_PCP_LISTS,
+		"invalid pindex %d (order %d mt %d flags %#x)",
+		pindex, order, mt, ftflags);
+	KUNIT_EXPECT_TRUE_MSG(test, test_bit(pindex, bitmap),
+		"pindex %d reused (order %d mt %d flags %#x)",
+		pindex, order, mt, ftflags);
+
+	/*
+	 * For THP, two migratetypes map to the same pindex,
+	 * just manually exclude one of those cases.
+	 */
+	if (!(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+		order == HPAGE_PMD_ORDER &&
+		mt == min(MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE)))
+		clear_bit(pindex, bitmap);
+
+	got_order = pindex_to_order(pindex);
+	KUNIT_EXPECT_EQ_MSG(test, order, got_order,
+		"roundtrip failed, got %d want %d (pindex %d mt %d flags %#x)",
+		got_order, order, pindex, mt, ftflags);
+}
+
+/* This just checks for basic arithmetic errors. */
+static void test_pindex_helpers(struct kunit *test)
+{
+	unsigned long bitmap[bitmap_size(NR_PCP_LISTS)];
+
+	/* Bit means "pindex not yet used". */
+	bitmap_fill(bitmap, NR_PCP_LISTS);
+
+	for (unsigned int order = 0; order < NR_PAGE_ORDERS; order++) {
+		for (int mt = 0; mt < MIGRATE_TYPES; mt++) {
+			__test_pindex_helpers(test, bitmap, mt, 0, order);
+			if (FREETYPE_UNMAPPED)
+				__test_pindex_helpers(test, bitmap, mt,
+						      FREETYPE_UNMAPPED, order);
+		}
+	}
+
+	KUNIT_EXPECT_TRUE_MSG(test, bitmap_empty(bitmap, NR_PCP_LISTS),
+		"unused pindices: %*pbl", NR_PCP_LISTS, bitmap);
+}
+
+static void __test_freetype_idx(struct kunit *test, unsigned int order,
+				int migratetype, unsigned int ftflags,
+				unsigned long *bitmap)
+{
+	freetype_t ft = migrate_to_freetype(migratetype, ftflags);
+	int idx = freetype_idx(ft);
+
+	if (idx == -1)
+		return;
+	KUNIT_ASSERT_GE(test, idx, 0);
+	KUNIT_ASSERT_LT(test, idx, NR_FREETYPE_IDXS);
+
+	KUNIT_EXPECT_LT_MSG(test, idx, NR_PCP_LISTS,
+		"invalid idx %d (order %d mt %d flags %#x)",
+		idx, order, migratetype, ftflags);
+	clear_bit(idx, bitmap);
+}
+
+static void test_freetype_idx(struct kunit *test)
+{
+	unsigned long bitmap[bitmap_size(NR_FREETYPE_IDXS)];
+
+	/* Bit means "pindex not yet used". */
+	bitmap_fill(bitmap, NR_FREETYPE_IDXS);
+
+	for (unsigned int order = 0; order < NR_PAGE_ORDERS; order++) {
+		for (int mt = 0; mt < MIGRATE_TYPES; mt++) {
+			__test_freetype_idx(test, order, mt, 0, bitmap);
+			if (FREETYPE_UNMAPPED)
+				__test_freetype_idx(test, order, mt,
+						    FREETYPE_UNMAPPED, bitmap);
+		}
+	}
+
+	KUNIT_EXPECT_TRUE_MSG(test, bitmap_empty(bitmap, NR_FREETYPE_IDXS),
+		"unused idxs: %*pbl", NR_PCP_LISTS, bitmap);
+}
+
+static struct kunit_case test_cases[] = {
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+	KUNIT_CASE_PARAM(test_alloc_map_unmap, gfp_gen_params),
+#endif
+	KUNIT_CASE(test_pindex_helpers),
+	KUNIT_CASE(test_freetype_idx),
+	{}
+};
+
+static struct kunit_suite test_suite = {
+	.name = "page_alloc",
+	.test_cases = test_cases,
+};
+
+kunit_test_suite(test_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available
  2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
                   ` (20 preceding siblings ...)
  2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
@ 2026-03-20 18:23 ` Brendan Jackman
  21 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

This is the simplest possible way to adopt __GFP_UNMAPPED. Use it to
allocate pages when it's available, meaning the
set_direct_map_invalid_noflush() call is no longer needed.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/secretmem.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 74 insertions(+), 13 deletions(-)

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 5f57ac4720d32..9fef91237358a 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/mm.h>
+#include <linux/mermap.h>
 #include <linux/fs.h>
 #include <linux/swap.h>
 #include <linux/mount.h>
@@ -47,13 +48,78 @@ bool secretmem_active(void)
 	return !!atomic_read(&secretmem_users);
 }
 
+/*
+ * If it's supported, allocate using __GFP_UNMAPPED. This lets the page
+ * allocator amortize TLB flushes and avoids direct map fragmentation.
+ */
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+static inline struct folio *secretmem_folio_alloc(gfp_t gfp, unsigned int order)
+{
+	int err;
+
+	/* Required for __GFP_UNMAPPED|__GFP_ZERO. */
+	err = mermap_mm_prepare(current->mm);
+	if (err)
+		return ERR_PTR(err);
+
+	return folio_alloc(gfp | __GFP_UNMAPPED, order);
+}
+
+static inline void secretmem_vma_close(struct vm_area_struct *area)
+{
+	/*
+	 * Because the folio was allocated with __GFP_UNMAPPED|__GFP_ZERO, a TLB
+	 * shootdown is required for the mermap in order to prevent CPU attacks
+	 * from leaking the content. This is the simplest possible way to
+	 * achieve that, but obviously it's inefficient - it should really be
+	 * amortized against the normal flushing that happened during the VMA
+	 * teardown.
+	 */
+	flush_tlb_mm(area->vm_mm);
+}
+
+/* Used __GFP_UNMAPPED so no need to restore direct map or flush TLB. */
+static inline void secretmem_folio_restore(struct folio *folio) { }
+static inline void secretmem_folio_flush(struct folio *folio) { }
+
+#else
+static inline struct folio *secretmem_folio_alloc(gfp_t gfp, unsigned int order)
+{
+	struct folio *folio;
+	int err;
+
+	folio = folio_alloc(gfp, order);
+	if (!folio)
+		return NULL;
+
+	err = set_direct_map_invalid_noflush(folio_page(folio, 0));
+	if (err) {
+		folio_put(folio);
+		return ERR_PTR(err);
+	}
+
+	return folio;
+}
+
+static inline void secretmem_folio_restore(struct folio *folio)
+{
+	set_direct_map_default_noflush(folio_page(folio, 0));
+}
+
+static inline void secretmem_folio_flush(struct folio *folio)
+{
+	unsigned long addr = (unsigned long)folio_address(folio);
+
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+}
+#endif
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	pgoff_t offset = vmf->pgoff;
 	gfp_t gfp = vmf->gfp_mask;
-	unsigned long addr;
 	struct folio *folio;
 	vm_fault_t ret;
 	int err;
@@ -66,16 +132,9 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 retry:
 	folio = filemap_lock_folio(mapping, offset);
 	if (IS_ERR(folio)) {
-		folio = folio_alloc(gfp | __GFP_ZERO, 0);
-		if (!folio) {
-			ret = VM_FAULT_OOM;
-			goto out;
-		}
-
-		err = set_direct_map_invalid_noflush(folio_page(folio, 0));
-		if (err) {
-			folio_put(folio);
-			ret = vmf_error(err);
+		folio = secretmem_folio_alloc(gfp | __GFP_ZERO, 0);
+		if (IS_ERR_OR_NULL(folio)) {
+			ret = folio ? vmf_error(PTR_ERR(folio)) : VM_FAULT_OOM;
 			goto out;
 		}
 
@@ -96,8 +155,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 			goto out;
 		}
 
-		addr = (unsigned long)folio_address(folio);
-		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+		secretmem_folio_flush(folio);
 	}
 
 	vmf->page = folio_file_page(folio, vmf->pgoff);
@@ -110,6 +168,9 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 
 static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
+#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
+	.close = secretmem_vma_close,
+#endif
 };
 
 static int secretmem_release(struct inode *inode, struct file *file)

-- 
2.51.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd()
  2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
@ 2026-03-20 19:42   ` Dave Hansen
  2026-03-23 11:01     ` Brendan Jackman
  2026-03-24 15:27   ` Borislav Petkov
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2026-03-20 19:42 UTC (permalink / raw)
  To: Brendan Jackman, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Vlastimil Babka, Wei Xu,
	Johannes Weiner, Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

On 3/20/26 11:23, Brendan Jackman wrote:
> -		/*
> -		 * The goal here is to allocate all possibly required
> -		 * hardware page tables pointed to by the top hardware
> -		 * level.

This comment is pretty important, IMNHO, and you zapped it.

The problem here is that the per-MM carved out space is PGD-sized. You
want to make sure there are page tables allocated for that space. But,
if you say "go allocate a p4d" then that will collapse down to doing
nothing on a 4-level system.

So, this is effectively:

	Go allocate a p4d or pud, depending on if it's 4 or 5 level.
	Basically, always allocate the level that the hardware PGD
	points to.

Could we put a comment to that effect around somewhere, please?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region"
  2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
@ 2026-03-20 19:47   ` Dave Hansen
  2026-03-23 12:01     ` Brendan Jackman
  2026-03-25 14:23   ` Brendan Jackman
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2026-03-20 19:47 UTC (permalink / raw)
  To: Brendan Jackman, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Vlastimil Babka, Wei Xu,
	Johannes Weiner, Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

On 3/20/26 11:23, Brendan Jackman wrote:
>  
> +#ifdef CONFIG_MM_LOCAL_REGION
> +static inline void mm_local_region_free(struct mm_struct *mm)
> +{
> +	if (mm_local_region_used(mm)) {
> +		struct mmu_gather tlb;
> +		unsigned long start = MM_LOCAL_BASE_ADDR;
> +		unsigned long end = MM_LOCAL_END_ADDR;
> +
> +		/*
> +		 * Although free_pgd_range() is intended for freeing user
> +		 * page-tables, it also works out for kernel mappings on x86.
> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
> +		 * range-tracking logic in __tlb_adjust_range().
> +		 */
> +		tlb_gather_mmu_fullmm(&tlb, mm);

These are superficial nits and I need to go through this series in
actual detail, but here are the nits:

Indentation is bad. What you have here double-indents the whole function.

Do this:

	struct mmu_gather tlb;
	unsigned long start = MM_LOCAL_BASE_ADDR;
	unsigned long end = MM_LOCAL_END_ADDR;

	if (!mm_local_region_used(mm))
		return;

	... rest of code here

> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
> +		 * range-tracking logic in __tlb_adjust_range().
> +		 */

Imperative voice, please.

And a meta-comment:

>  Documentation/arch/x86/x86_64/mm.rst    |   4 +-
>  arch/x86/Kconfig                        |   2 +
>  arch/x86/include/asm/mmu_context.h      | 119 ++++++++++++++++++++++++++++-
>  arch/x86/include/asm/page.h             |  32 ++++++++
>  arch/x86/include/asm/pgtable_32_areas.h |   9 ++-
>  arch/x86/include/asm/pgtable_64_types.h |  12 ++-
>  arch/x86/kernel/ldt.c                   | 130 +++++---------------------------

This is too big and there's too much going on here. This is doing a few
logical things like both introducing mm-local regions *and* making the
LDT remap one of them.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd()
  2026-03-20 19:42   ` Dave Hansen
@ 2026-03-23 11:01     ` Brendan Jackman
  0 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-23 11:01 UTC (permalink / raw)
  To: Dave Hansen, Brendan Jackman, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Wei Xu, Johannes Weiner, Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

On Fri Mar 20, 2026 at 7:42 PM UTC, Dave Hansen wrote:
> On 3/20/26 11:23, Brendan Jackman wrote:
>> -		/*
>> -		 * The goal here is to allocate all possibly required
>> -		 * hardware page tables pointed to by the top hardware
>> -		 * level.
>
> This comment is pretty important, IMNHO, and you zapped it.
>
> The problem here is that the per-MM carved out space is PGD-sized. You
> want to make sure there are page tables allocated for that space. But,
> if you say "go allocate a p4d" then that will collapse down to doing
> nothing on a 4-level system.
>
> So, this is effectively:
>
> 	Go allocate a p4d or pud, depending on if it's 4 or 5 level.
> 	Basically, always allocate the level that the hardware PGD
> 	points to.
>
> Could we put a comment to that effect around somewhere, please?

Hm I kinda thought the comments I left in there captured all this stuff,
but yeah I can see this is a bit of a weird function so more commentary
makes sense. How about I just put a few more words into the top comment:

/* 
 * Allocate all possibly requried hardware page tables pointed to ths
 * top hardware level. In other words, allocate a p4d on 5-level or a
 * pud on 4-level.
 */

And then just leave the internal one as it is:

	/*
	 * On 4-level systems, the P4D layer is folded away and
	 * the above code does no preallocation.  Below, go down
	 * to the pud _software_ level to ensure the second
	 * hardware level is allocated on 4-level systems too.
	 */


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region"
  2026-03-20 19:47   ` Dave Hansen
@ 2026-03-23 12:01     ` Brendan Jackman
  2026-03-23 12:57       ` Brendan Jackman
  0 siblings, 1 reply; 32+ messages in thread
From: Brendan Jackman @ 2026-03-23 12:01 UTC (permalink / raw)
  To: Dave Hansen, Brendan Jackman, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Andrew Morton, David Hildenbrand, Vlastimil Babka,
	Wei Xu, Johannes Weiner, Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

On Fri Mar 20, 2026 at 7:47 PM UTC, Dave Hansen wrote:
> On 3/20/26 11:23, Brendan Jackman wrote:
>>  
>> +#ifdef CONFIG_MM_LOCAL_REGION
>> +static inline void mm_local_region_free(struct mm_struct *mm)
>> +{
>> +	if (mm_local_region_used(mm)) {
>> +		struct mmu_gather tlb;
>> +		unsigned long start = MM_LOCAL_BASE_ADDR;
>> +		unsigned long end = MM_LOCAL_END_ADDR;
>> +
>> +		/*
>> +		 * Although free_pgd_range() is intended for freeing user
>> +		 * page-tables, it also works out for kernel mappings on x86.
>> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
>> +		 * range-tracking logic in __tlb_adjust_range().
>> +		 */
>> +		tlb_gather_mmu_fullmm(&tlb, mm);
>
> These are superficial nits and I need to go through this series in
> actual detail, 

Thanks, this series is pretty brutal so I'm very happy to receive
incremental reviews!

> but here are the nits:
>
> Indentation is bad. What you have here double-indents the whole function.
>
> Do this:
>
> 	struct mmu_gather tlb;
> 	unsigned long start = MM_LOCAL_BASE_ADDR;
> 	unsigned long end = MM_LOCAL_END_ADDR;
>
> 	if (!mm_local_region_used(mm))
> 		return;
>
> 	... rest of code here

Ack

>
>> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
>> +		 * range-tracking logic in __tlb_adjust_range().
>> +		 */
>
> Imperative voice, please.

Yeah I don't think I'm ever gonna stop making this mistake. Any LLM
should be able to catch this for me, I think it's time to find a way to
get that into my pre-mail workflow.

> And a meta-comment:
>
>>  Documentation/arch/x86/x86_64/mm.rst    |   4 +-
>>  arch/x86/Kconfig                        |   2 +
>>  arch/x86/include/asm/mmu_context.h      | 119 ++++++++++++++++++++++++++++-
>>  arch/x86/include/asm/page.h             |  32 ++++++++
>>  arch/x86/include/asm/pgtable_32_areas.h |   9 ++-
>>  arch/x86/include/asm/pgtable_64_types.h |  12 ++-
>>  arch/x86/kernel/ldt.c                   | 130 +++++---------------------------
>
> This is too big and there's too much going on here. This is doing a few
> logical things like both introducing mm-local regions *and* making the
> LDT remap one of them.

IIRC I tried this but having both an mm-local region and a separate LDT
remap at the same time is annoying, it would mean reviewing temporary
code to deal with both existing at the same time. 

However I just had an idea: I will try creating the mm-local region but
with size 0. Then in a separate patch I'll expand it and simultaneously
move the LDT remap into it.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region"
  2026-03-23 12:01     ` Brendan Jackman
@ 2026-03-23 12:57       ` Brendan Jackman
  0 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-23 12:57 UTC (permalink / raw)
  To: Brendan Jackman, Dave Hansen, Borislav Petkov, Dave Hansen
  Cc: linux-mm, linux-kernel, x86, kfree, clm

TANGENT - off topic, removing most people from CC.

On Mon Mar 23, 2026 at 12:01 PM UTC, Brendan Jackman wrote:
> On Fri Mar 20, 2026 at 7:47 PM UTC, Dave Hansen wrote:

>>> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
>>> +		 * range-tracking logic in __tlb_adjust_range().
>>> +		 */
>>
>> Imperative voice, please.
>
> Yeah I don't think I'm ever gonna stop making this mistake. Any LLM
> should be able to catch this for me, I think it's time to find a way to
> get that into my pre-mail workflow.

Just dumping what I learned from briefly looking into this:

It looks like Sashiko [0] and Chris Mason's review-prompts are not really
well geared-up to deal with trivialities like this right now, they are
still evolving fast and focussing on much more advanced topics, and
AFAICS they don't have a standardised way for the agent to "shell out"
to a cheap model to do simple checks like this. So for now until that
stuff crystallises a bit more I'll just use a dumb standalone script.

I wrote a quick prompt to check for these particular rules and found
that the "pro" model worked perfectly but took ages (and probably an
obscene amount of energy) while gemini-2.5-flash-lite was instant but
very unreliable. Then I asked the pro model to rework the prompt for the
benefit of the small model. Its version made the small model works
reliably. 

I'll paste the prompt below. The command to run it using Google's stuff
is:

gemini --prompt "$(cat check_patch.md) $(git show)" --model "gemini-3.1-flash-lite-preview"

I assume open models that fit on a laptop can handle this task too but I
haven't tried it as Google's tooling seems to be hardcoded to funnel
you to the cloud service. Yuck, something to figure out on the weekend I
suppose.

(Alternatively I bet a plain old NLTK script can handle these particular
rules. But that will run into limiations quickly while dumb LLMs are
generic).

[0] https://lwn.net/ml/all/87jyv7a1q5.fsf@linux.dev/

---

You are a strict code reviewer. You will be given a patch file, formatted email, or Git diff.
Your only task is to review the English style of newly added code comments (lines starting with '+' that are comments, e.g., '+ //', '+ /*', '+ *', or '+ #'). Ignore all actual code, variable names, and removed lines.

Flag a comment if it violates either of these two rules:

1. Avoid personal pronouns. For example: Do not use: I, we, you, our, us, my, your. Other pronouns such as "it" are fine.

2. Use the imperative mood to describe what the code does. (e.g., Use "Return the value" instead of "Returns the value" or "This returns the value").

Output format:
If there are no violations, output exactly: "LGTM".
If there are violations, output the snippet from the input where the violation occurs. Prefix each line with a '>' character, followed by a brief description of the violated rule.

### Example 1 (Pronoun Violation) ###
Input Patch:
+                /*
+                 * Although free_pgd_range() is intended for freeing user
+                 * page-tables, it also works out for kernel mappings on x86.
+                 * We use tlb_gather_mmu_fullmm() to avoid confusing the
+                 * range-tracking logic in __tlb_adjust_range().
+                 */
+                tlb_gather_mmu_fullmm(&tlb, mm);

Output:
>+                 * We use tlb_gather_mmu_fullmm() to avoid confusing the
>+                 * range-tracking logic in __tlb_adjust_range().
Avoid personal pronouns ("We").

### Example 2 (Imperative Mood Violation) ###
Input Patch:
+ // Initializes the counter and prepares the struct.
+ counter = 0;

Output:
>+ // Initializes the counter and prepares the struct.
Use the imperative mood (e.g., "Initialize the counter...").

<END INSTRUCTIONS>
<BEGIN PATCH FOR REVIEW>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/22] mm: KUnit tests for the mermap
  2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
@ 2026-03-24  8:00   ` kernel test robot
  0 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2026-03-24  8:00 UTC (permalink / raw)
  To: Brendan Jackman, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Vlastimil Babka, Wei Xu,
	Johannes Weiner, Zi Yan, Lorenzo Stoakes
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel, x86,
	rppt, Sumit Garg, derkling, reijiw, Will Deacon, rientjes,
	Kalyazin, Nikita, patrick.roy, Itazuri, Takahiro, Andy Lutomirski,
	David Kaplan, Thomas Gleixner, Brendan Jackman, Yosry Ahmed

Hi Brendan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on b5d083a3ed1e2798396d5e491432e887da8d4a06]

url:    https://github.com/intel-lab-lkp/linux/commits/Brendan-Jackman/x86-mm-split-out-preallocate_sub_pgd/20260321-042521
base:   b5d083a3ed1e2798396d5e491432e887da8d4a06
patch link:    https://lore.kernel.org/r/20260320-page_alloc-unmapped-v2-7-28bf1bd54f41%40google.com
patch subject: [PATCH v2 07/22] mm: KUnit tests for the mermap
config: x86_64-randconfig-101-20260324 (https://download.01.org/0day-ci/archive/20260324/202603241512.3kG43FzT-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603241512.3kG43FzT-lkp@intel.com/

cocci warnings: (new ones prefixed by >>)
>> mm/tests/mermap_kunit.c:156:2-3: Unneeded semicolon

vim +156 mm/tests/mermap_kunit.c

   131	
   132	static void test_multiple_allocs(struct kunit *test)
   133	{
   134		struct __mermap_put_args *argss[NR_NORMAL_ALLOCS] = { };
   135		struct page *pages[NR_NORMAL_ALLOCS + 1];
   136		struct mermap_alloc *reserved_alloc;
   137		struct mm_struct *mm = get_mm(test);
   138		int magic = 0xE4A4;
   139	
   140		for (int i = 0; i < ARRAY_SIZE(pages); i++) {
   141			pages[i] = alloc_page_wrapper(test, GFP_KERNEL);
   142			WRITE_ONCE(*(int *)page_to_virt(pages[i]), magic + i);
   143		}
   144	
   145		for (int i = 0; i < ARRAY_SIZE(argss); i++) {
   146			unsigned long base = mermap_cpu_base(raw_smp_processor_id());
   147			unsigned long end = mermap_cpu_end(raw_smp_processor_id());
   148			unsigned long addr;
   149	
   150			argss[i] = __mermap_get_wrapper(test, mm, pages[i], PAGE_SIZE, PAGE_KERNEL);
   151			KUNIT_ASSERT_NOT_NULL_MSG(test, argss[i], "alloc %d failed", i);
   152	
   153			addr = (unsigned long) mermap_addr(argss[i]->alloc);
   154			KUNIT_EXPECT_GE_MSG(test, addr, base, "alloc %d out of range", i);
   155			KUNIT_EXPECT_LT_MSG(test, addr, end, "alloc %d out of range", i);
 > 156		};
   157	
   158		/*
   159		 * Read through the mappings to try and detect if they point to the
   160		 * pages we wrote earlier.
   161		 */
   162		kthread_use_mm(mm);
   163		for (int i = 0; i < ARRAY_SIZE(pages) - 1; i++) {
   164			int *ptr  = (int *)mermap_addr(argss[i]->alloc);
   165	
   166			KUNIT_EXPECT_EQ(test, *ptr, magic + i);
   167		}
   168	
   169		/* Run out of alloc structures, only reserved allocs should succeed now. */
   170		KUNIT_ASSERT_NULL(test, __mermap_get(mm, pages[NR_NORMAL_ALLOCS],
   171						     PAGE_SIZE, PAGE_KERNEL, false));
   172		preempt_disable();
   173		reserved_alloc = __mermap_get(mm, pages[NR_NORMAL_ALLOCS],
   174					      PAGE_SIZE, PAGE_KERNEL, true);
   175		KUNIT_EXPECT_NOT_NULL(test, reserved_alloc);
   176		/* Also check if this mapping seems correct*/
   177		if (reserved_alloc) {
   178			int *ptr  = (int *)mermap_addr(reserved_alloc);
   179	
   180			KUNIT_EXPECT_EQ(test, *ptr, magic + NR_NORMAL_ALLOCS);
   181	
   182			mermap_put(reserved_alloc);
   183		}
   184		preempt_enable();
   185	
   186		kthread_unuse_mm(mm);
   187	}
   188	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd()
  2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
  2026-03-20 19:42   ` Dave Hansen
@ 2026-03-24 15:27   ` Borislav Petkov
  2026-03-25 13:28     ` Brendan Jackman
  1 sibling, 1 reply; 32+ messages in thread
From: Borislav Petkov @ 2026-03-24 15:27 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Dave Hansen, Peter Zijlstra, Andrew Morton, David Hildenbrand,
	Vlastimil Babka, Wei Xu, Johannes Weiner, Zi Yan, Lorenzo Stoakes,
	linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

On Fri, Mar 20, 2026 at 06:23:25PM +0000, Brendan Jackman wrote:
> This code will be needed elsewhere in a following patch. Split out the
> trivial code move for easy review.
> 
> This changes the logging slightly: instead of panic() directly reporting
> the level of the failure, there is now a generic panic message which
> will be preceded by a separate warn that reports the level of the
> failure. This is a simple way to have this helper suit the needs of its
> new user as well as the existing one.

"Describe your changes in imperative mood, e.g. “make xyzzy do frotz” instead
of “[This patch] makes xyzzy do frotz” or “[I] changed xyzzy to do frotz”, as
if you are giving orders to the codebase to change its behaviour."

> Other than logging, no functional change intended.
> 
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  arch/x86/include/asm/pgalloc.h | 33 +++++++++++++++++++++++++++++++
>  arch/x86/mm/init_64.c          | 44 +++++++-----------------------------------
>  2 files changed, 40 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
> index c88691b15f3c6..3541b86c9c6b0 100644
> --- a/arch/x86/include/asm/pgalloc.h
> +++ b/arch/x86/include/asm/pgalloc.h

Why in a header?

That function is kinda bigger than the rest of the oneliners there...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd()
  2026-03-24 15:27   ` Borislav Petkov
@ 2026-03-25 13:28     ` Brendan Jackman
  0 siblings, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-25 13:28 UTC (permalink / raw)
  To: Borislav Petkov, Brendan Jackman
  Cc: Dave Hansen, Peter Zijlstra, Andrew Morton, David Hildenbrand,
	Vlastimil Babka, Wei Xu, Johannes Weiner, Zi Yan, Lorenzo Stoakes,
	linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

On Tue Mar 24, 2026 at 3:27 PM UTC, Borislav Petkov wrote:
> On Fri, Mar 20, 2026 at 06:23:25PM +0000, Brendan Jackman wrote:
>> This code will be needed elsewhere in a following patch. Split out the
>> trivial code move for easy review.
>> 
>> This changes the logging slightly: instead of panic() directly reporting
>> the level of the failure, there is now a generic panic message which
>> will be preceded by a separate warn that reports the level of the
>> failure. This is a simple way to have this helper suit the needs of its
>> new user as well as the existing one.
>
> "Describe your changes in imperative mood, e.g. “make xyzzy do frotz” instead
> of “[This patch] makes xyzzy do frotz” or “[I] changed xyzzy to do frotz”, as
> if you are giving orders to the codebase to change its behaviour."

Yeah there are a bunch of comments in this series where I've written "we
do XYZ" instead of "do XYZ", I've set up an AI prompt [0] to try and
catch it for me before I send you guys crazy with yet another series
that makes the same mistake over and over.

[0] https://lore.kernel.org/all/DHA6G1E2H5P4.2D7JKTRKIBE3U@google.com/

TBH for this particular case it seems borderline? The referent of the
"this" is the previous paragraph, it's trying to show that the logging
change is a side effect of the code movement rather than a part of
the patch's intent. But there are more explicit ways to carry that
message so I'll rewrite it.

>> Other than logging, no functional change intended.
>> 
>> Signed-off-by: Brendan Jackman <jackmanb@google.com>
>> ---
>>  arch/x86/include/asm/pgalloc.h | 33 +++++++++++++++++++++++++++++++
>>  arch/x86/mm/init_64.c          | 44 +++++++-----------------------------------
>>  2 files changed, 40 insertions(+), 37 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
>> index c88691b15f3c6..3541b86c9c6b0 100644
>> --- a/arch/x86/include/asm/pgalloc.h
>> +++ b/arch/x86/include/asm/pgalloc.h
>
> Why in a header?
>
> That function is kinda bigger than the rest of the oneliners there...

Hm, I suspect I wanted to avoid creating new ifdefs in the C file? Seems
harmless though. Looks fine in arch/x86/mm/pgtable.c.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region"
  2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
  2026-03-20 19:47   ` Dave Hansen
@ 2026-03-25 14:23   ` Brendan Jackman
  1 sibling, 0 replies; 32+ messages in thread
From: Brendan Jackman @ 2026-03-25 14:23 UTC (permalink / raw)
  To: Brendan Jackman, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Andrew Morton, David Hildenbrand, Vlastimil Babka, Wei Xu,
	Johannes Weiner, Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Yosry Ahmed

Summarizing Sashiko review [0] so all the comments are in the same place...

https://sashiko.dev/#/patchset/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41%40google.com

On Fri Mar 20, 2026 at 6:23 PM UTC, Brendan Jackman wrote:
> Various security features benefit from having process-local address
> mappings. Examples include no-direct-map guest_memfd [2] and significant
> optimizations for ASI [1].
>
> As pointed out by Andy in [0], x86 already has a PGD entry that is local
> to the mm, which is used for the LDT.
>
> So, simply redefine that entry's region as "the mm-local region" and
> then redefine the LDT region as a sub-region of that.
>
> With the currently-envisaged usecases, there will be many situations
> where almost no processes have any need for the mm-local region.
> Therefore, avoid its overhead (memory cost of pagetables, alloc/free
> overhead during fork/exit) for processes that don't use it by requiring
> its users to explicitly initialize it via the new mm_local_* API.
>
> This means that the LDT remap code can be simplified:
>
> 1. map_ldt_struct_to_user() and free_ldt_pgtables() are no longer
>    required as the mm_local core code handles that automatically.
>
> 2. The sanity-check logic is unified: in both cases just walk the
>    pagetables via a generic mechanism. This slightly relaxes the
>    sanity-checking since lookup_address_in_pgd() is more flexible than
>    pgd_to_pmd_walk(), but this seems to be worth it for the simplified
>    code.
>
> On 64-bit, the mm-local region gets a whole PGD. On 32-bit, it just
> i.e. one PMD, i.e. it is completely consumed by the LDT remap - no
> investigation has been done into whether it's feasible to expand the
> region on 32-bit. Most likely there is no strong usecase for that
> anyway.
>
> In both cases, in order to combine the need for an on-demand mm
> initialisation, combined with the desire to transparently handle
> propagating mappings to userspace under KPTI, the user and kernel
> pagetables are shared at the highest level possible. For PAE that means
> the PTE table is shared and for 64-bit the P4D/PUD. This is implemented
> by pre-allocating the first shared table when the mm-local region is
> first initialised.
>
> The PAE implementation of mm_local_map_to_user() does not allocate
> pagetables, it assumes the PMD has been preallocated. To make that
> assumption safer, expose PREALLOCATED_PMDs in the arch headers so that
> mm_local_map_to_user() can have a BUILD_BUG_ON().
>
> [0] https://lore.kernel.org/linux-mm/CALCETrXHbS9VXfZ80kOjiTrreM2EbapYeGp68mvJPbosUtorYA@mail.gmail.com/
> [1] https://linuxasi.dev/
> [2] https://lore.kernel.org/all/20250924151101.2225820-1-patrick.roy@campus.lmu.de
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  Documentation/arch/x86/x86_64/mm.rst    |   4 +-
>  arch/x86/Kconfig                        |   2 +
>  arch/x86/include/asm/mmu_context.h      | 119 ++++++++++++++++++++++++++++-
>  arch/x86/include/asm/page.h             |  32 ++++++++
>  arch/x86/include/asm/pgtable_32_areas.h |   9 ++-
>  arch/x86/include/asm/pgtable_64_types.h |  12 ++-
>  arch/x86/kernel/ldt.c                   | 130 +++++---------------------------
>  arch/x86/mm/pgtable.c                   |  32 +-------
>  include/linux/mm.h                      |  13 ++++
>  include/linux/mm_types.h                |   2 +
>  kernel/fork.c                           |   1 +
>  mm/Kconfig                              |  11 +++
>  12 files changed, 217 insertions(+), 150 deletions(-)
>
> diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
> index a6cf05d51bd8c..fa2bb7bab6a42 100644
> --- a/Documentation/arch/x86/x86_64/mm.rst
> +++ b/Documentation/arch/x86/x86_64/mm.rst
> @@ -53,7 +53,7 @@ Complete virtual memory map with 4-level page tables
>    ____________________________________________________________|___________________________________________________________
>                      |            |                  |         |
>     ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
> -   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
> +   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | MM-local kernel data. Includes LDT remap for PTI
>     ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
>     ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
>     ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
> @@ -123,7 +123,7 @@ Complete virtual memory map with 5-level page tables
>    ____________________________________________________________|___________________________________________________________
>                      |            |                  |         |
>     ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
> -   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
> +   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | MM-local kernel data. Includes LDT remap for PTI
>     ff11000000000000 |  -59.75 PB | ff90ffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
>     ff91000000000000 |  -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
>     ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 8038b26ae99e0..d7073b6077c62 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -133,6 +133,7 @@ config X86
>  	select ARCH_SUPPORTS_RT
>  	select ARCH_SUPPORTS_AUTOFDO_CLANG
>  	select ARCH_SUPPORTS_PROPELLER_CLANG    if X86_64
> +	select ARCH_SUPPORTS_MM_LOCAL_REGION	if X86_64 || X86_PAE
>  	select ARCH_USE_BUILTIN_BSWAP
>  	select ARCH_USE_CMPXCHG_LOCKREF		if X86_CX8
>  	select ARCH_USE_MEMTEST
> @@ -2323,6 +2324,7 @@ config CMDLINE_OVERRIDE
>  config MODIFY_LDT_SYSCALL
>  	bool "Enable the LDT (local descriptor table)" if EXPERT
>  	default y
> +	select MM_LOCAL_REGION if MITIGATION_PAGE_TABLE_ISOLATION || X86_PAE
>  	help
>  	  Linux can allow user programs to install a per-process x86
>  	  Local Descriptor Table (LDT) using the modify_ldt(2) system
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index ef5b507de34e2..14f75d1d7e28f 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -8,8 +8,10 @@
>  
>  #include <trace/events/tlb.h>
>  
> +#include <asm/tlb.h>
>  #include <asm/tlbflush.h>
>  #include <asm/paravirt.h>
> +#include <asm/pgalloc.h>
>  #include <asm/debugreg.h>
>  #include <asm/gsseg.h>
>  #include <asm/desc.h>
> @@ -59,7 +61,6 @@ static inline void init_new_context_ldt(struct mm_struct *mm)
>  }
>  int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
>  void destroy_context_ldt(struct mm_struct *mm);
> -void ldt_arch_exit_mmap(struct mm_struct *mm);
>  #else	/* CONFIG_MODIFY_LDT_SYSCALL */
>  static inline void init_new_context_ldt(struct mm_struct *mm) { }
>  static inline int ldt_dup_context(struct mm_struct *oldmm,
> @@ -68,7 +69,6 @@ static inline int ldt_dup_context(struct mm_struct *oldmm,
>  	return 0;
>  }
>  static inline void destroy_context_ldt(struct mm_struct *mm) { }
> -static inline void ldt_arch_exit_mmap(struct mm_struct *mm) { }
>  #endif
>  
>  #ifdef CONFIG_MODIFY_LDT_SYSCALL
> @@ -223,10 +223,123 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
>  	return ldt_dup_context(oldmm, mm);
>  }
>  
> +#ifdef CONFIG_MM_LOCAL_REGION
> +static inline void mm_local_region_free(struct mm_struct *mm)
> +{
> +	if (mm_local_region_used(mm)) {
> +		struct mmu_gather tlb;
> +		unsigned long start = MM_LOCAL_BASE_ADDR;
> +		unsigned long end = MM_LOCAL_END_ADDR;
> +
> +		/*
> +		 * Although free_pgd_range() is intended for freeing user
> +		 * page-tables, it also works out for kernel mappings on x86.
> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
> +		 * range-tracking logic in __tlb_adjust_range().
> +		 */
> +		tlb_gather_mmu_fullmm(&tlb, mm);
> +		free_pgd_range(&tlb, start, end, start, end);
> +		tlb_finish_mmu(&tlb);
> +
> +		mm_flags_clear(MMF_LOCAL_REGION_USED, mm);
> +	}
> +}
> +
> +#if defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) && defined(CONFIG_X86_PAE)
> +static inline pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
> +{
> +	p4d_t *p4d;
> +	pud_t *pud;
> +
> +	if (pgd->pgd == 0)
> +		return NULL;
> +
> +	p4d = p4d_offset(pgd, va);
> +	if (p4d_none(*p4d))
> +		return NULL;
> +
> +	pud = pud_offset(p4d, va);
> +	if (pud_none(*pud))
> +		return NULL;
> +
> +	return pmd_offset(pud, va);
> +}
> +
> +static inline int mm_local_map_to_user(struct mm_struct *mm)
> +{
> +	BUILD_BUG_ON(!PREALLOCATED_PMDS);
> +	pgd_t *k_pgd = pgd_offset(mm, MM_LOCAL_BASE_ADDR);
> +	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> +	pmd_t *k_pmd, *u_pmd;
> +	int err;
> +
> +	k_pmd = pgd_to_pmd_walk(k_pgd, MM_LOCAL_BASE_ADDR);
> +	u_pmd = pgd_to_pmd_walk(u_pgd, MM_LOCAL_BASE_ADDR);
> +
> +	BUILD_BUG_ON(MM_LOCAL_END_ADDR - MM_LOCAL_BASE_ADDR > PMD_SIZE);
> +
> +	/* Preallocate the PTE table so it can be shared. */
> +	err = pte_alloc(mm, k_pmd);
> +	if (err)
> +		return err;
> +
> +	/* Point the userspace PMD at the same PTE as the kernel PMD. */
> +	set_pmd(u_pmd, *k_pmd);
> +	return 0;
> +}
> +#elif defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)
> +static inline int mm_local_map_to_user(struct mm_struct *mm)
> +{
> +	pgd_t *pgd;
> +	int err;
> +
> +	err = preallocate_sub_pgd(mm, MM_LOCAL_BASE_ADDR);
> +	if (err)
> +		return err;
> +
> +	pgd = pgd_offset(mm, MM_LOCAL_BASE_ADDR);
> +	set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> +	return 0;
> +}
> +#else
> +static inline int mm_local_map_to_user(struct mm_struct *mm)
> +{
> +	WARN_ONCE(1, "mm_local_map_to_user() not implemented");
> +	return -EINVAL;
> +}
> +#endif
> +
> +/*
> + * Do initial setup of the user-local region. Call from process context.
> + *
> + * Under PTI, userspace shares the pagetables for the mm-local region with the
> + * kernel (if you map stuff here, it's immediately mapped into userspace too).
> + * LDT remap. It's assuming nothing gets mapped in here that needs to be
> + * protected from Meltdown-type attacks from the current process.
> + */
> +static inline int mm_local_region_init(struct mm_struct *mm)
> +{
> +	int err;
> +
> +	if (boot_cpu_has(X86_FEATURE_PTI)) {
> +		err = mm_local_map_to_user(mm);
> +		if (err)
> +			return err;
> +	}
> +
> +	mm_flags_set(MMF_LOCAL_REGION_USED, mm);
> +
> +	return 0;
> +}
> +
> +#else
> +static inline void mm_local_region_free(struct mm_struct *mm) { }
> +#endif /* CONFIG_MM_LOCAL_REGION */
> +
>  static inline void arch_exit_mmap(struct mm_struct *mm)
>  {
>  	paravirt_arch_exit_mmap(mm);
> -	ldt_arch_exit_mmap(mm);
> +	mm_local_region_free(mm);
>  }
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 416dc88e35c15..4de4715c3b40f 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -78,6 +78,38 @@ static __always_inline u64 __is_canonical_address(u64 vaddr, u8 vaddr_bits)
>  	return __canonical_address(vaddr, vaddr_bits) == vaddr;
>  }
>  
> +#ifdef CONFIG_X86_PAE
> +
> +/*
> + * In PAE mode, we need to do a cr3 reload (=tlb flush) when
> + * updating the top-level pagetable entries to guarantee the
> + * processor notices the update.  Since this is expensive, and
> + * all 4 top-level entries are used almost immediately in a
> + * new process's life, we just pre-populate them here.
> + */
> +#define PREALLOCATED_PMDS	PTRS_PER_PGD
> +/*
> + * "USER_PMDS" are the PMDs for the user copy of the page tables when
> + * PTI is enabled. They do not exist when PTI is disabled.  Note that
> + * this is distinct from the user _portion_ of the kernel page tables
> + * which always exists.
> + *
> + * We allocate separate PMDs for the kernel part of the user page-table
> + * when PTI is enabled. We need them to map the per-process LDT into the
> + * user-space page-table.
> + */
> +#define PREALLOCATED_USER_PMDS (boot_cpu_has(X86_FEATURE_PTI) ? KERNEL_PGD_PTRS : 0)
> +#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
> +
> +#else  /* !CONFIG_X86_PAE */
> +
> +/* No need to prepopulate any pagetable entries in non-PAE modes. */
> +#define PREALLOCATED_PMDS	0
> +#define PREALLOCATED_USER_PMDS	0
> +#define MAX_PREALLOCATED_USER_PMDS 0
> +
> +#endif	/* CONFIG_X86_PAE */
> +
>  #endif	/* __ASSEMBLER__ */
>  
>  #include <asm-generic/memory_model.h>
> diff --git a/arch/x86/include/asm/pgtable_32_areas.h b/arch/x86/include/asm/pgtable_32_areas.h
> index 921148b429676..7fccb887f8b33 100644
> --- a/arch/x86/include/asm/pgtable_32_areas.h
> +++ b/arch/x86/include/asm/pgtable_32_areas.h
> @@ -30,9 +30,14 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
>  #define CPU_ENTRY_AREA_BASE	\
>  	((FIXADDR_TOT_START - PAGE_SIZE*(CPU_ENTRY_AREA_PAGES+1)) & PMD_MASK)
>  
> -#define LDT_BASE_ADDR		\
> -	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
> +/*
> + * On 32-bit the mm-local region is currently completely consumed by the LDT
> + * remap.
> + */
> +#define MM_LOCAL_BASE_ADDR	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
> +#define MM_LOCAL_END_ADDR	(MM_LOCAL_BASE_ADDR + PMD_SIZE)
>  
> +#define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
>  #define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
>  
>  #define PKMAP_BASE		\
> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> index 7eb61ef6a185f..1181565966405 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -5,8 +5,11 @@
>  #include <asm/sparsemem.h>
>  
>  #ifndef __ASSEMBLER__
> +#include <linux/build_bug.h>
>  #include <linux/types.h>
>  #include <asm/kaslr.h>
> +#include <asm/page_types.h>
> +#include <uapi/asm/ldt.h>
>  
>  /*
>   * These are used to make use of C type-checking..
> @@ -100,9 +103,12 @@ extern unsigned int ptrs_per_p4d;
>  #define GUARD_HOLE_BASE_ADDR	(GUARD_HOLE_PGD_ENTRY << PGDIR_SHIFT)
>  #define GUARD_HOLE_END_ADDR	(GUARD_HOLE_BASE_ADDR + GUARD_HOLE_SIZE)
>  
> -#define LDT_PGD_ENTRY		-240UL
> -#define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
> -#define LDT_END_ADDR		(LDT_BASE_ADDR + PGDIR_SIZE)
> +#define MM_LOCAL_PGD_ENTRY	-240UL
> +#define MM_LOCAL_BASE_ADDR	(MM_LOCAL_PGD_ENTRY << PGDIR_SHIFT)
> +#define MM_LOCAL_END_ADDR	((MM_LOCAL_PGD_ENTRY + 1) << PGDIR_SHIFT)
> +
> +#define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
> +#define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
>  
>  #define __VMALLOC_BASE_L4	0xffffc90000000000UL
>  #define __VMALLOC_BASE_L5 	0xffa0000000000000UL
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index 40c5bf97dd5cc..fb2a1914539f8 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -31,6 +31,8 @@
>  
>  #include <xen/xen.h>
>  
> +/* LDTs are double-buffered, the buffers are called slots. */
> +#define LDT_NUM_SLOTS		2
>  /* This is a multiple of PAGE_SIZE. */
>  #define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
>  
> @@ -186,100 +188,36 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>  
>  #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
>  
> -static void do_sanity_check(struct mm_struct *mm,
> -			    bool had_kernel_mapping,
> -			    bool had_user_mapping)
> +static void sanity_check_ldt_mapping(struct mm_struct *mm)
>  {
> +	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
> +	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> +	unsigned int k_level, u_level;
> +	bool had_kernel, had_user;
> +
> +	had_kernel = lookup_address_in_pgd(k_pgd, LDT_BASE_ADDR, &k_level);
> +	had_user   = lookup_address_in_pgd(u_pgd, LDT_BASE_ADDR, &u_level);
> +
>  	if (mm->context.ldt) {
>  		/*
>  		 * We already had an LDT.  The top-level entry should already
>  		 * have been allocated and synchronized with the usermode
>  		 * tables.
>  		 */
> -		WARN_ON(!had_kernel_mapping);
> +		WARN_ON(!had_kernel);
>  		if (boot_cpu_has(X86_FEATURE_PTI))
> -			WARN_ON(!had_user_mapping);
> +			WARN_ON(!had_user);
>  	} else {
>  		/*
>  		 * This is the first time we're mapping an LDT for this process.
>  		 * Sync the pgd to the usermode tables.
>  		 */
> -		WARN_ON(had_kernel_mapping);
> +		WARN_ON(had_kernel);
>  		if (boot_cpu_has(X86_FEATURE_PTI))
> -			WARN_ON(had_user_mapping);
> +			WARN_ON(had_user);

But under PAE the PTE is preallocated. lookup_address_in_pgd() returns
NULL if the address is unmapped at a higher level but for 4K
specifically it returns a non-NULL pointer to a non-present PTE.

This WARNs immediately when I run the selftests so I suspect I broke
this and then forgot to retest with PTI.

>  	}
>  }
>  
> -#ifdef CONFIG_X86_PAE
> -
> -static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
> -{
> -	p4d_t *p4d;
> -	pud_t *pud;
> -
> -	if (pgd->pgd == 0)
> -		return NULL;
> -
> -	p4d = p4d_offset(pgd, va);
> -	if (p4d_none(*p4d))
> -		return NULL;
> -
> -	pud = pud_offset(p4d, va);
> -	if (pud_none(*pud))
> -		return NULL;
> -
> -	return pmd_offset(pud, va);
> -}
> -
> -static void map_ldt_struct_to_user(struct mm_struct *mm)
> -{
> -	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> -	pmd_t *k_pmd, *u_pmd;
> -
> -	k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
> -	u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
> -
> -	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> -		set_pmd(u_pmd, *k_pmd);
> -}
> -
> -static void sanity_check_ldt_mapping(struct mm_struct *mm)
> -{
> -	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> -	bool had_kernel, had_user;
> -	pmd_t *k_pmd, *u_pmd;
> -
> -	k_pmd      = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
> -	u_pmd      = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
> -	had_kernel = (k_pmd->pmd != 0);
> -	had_user   = (u_pmd->pmd != 0);
> -
> -	do_sanity_check(mm, had_kernel, had_user);
> -}
> -
> -#else /* !CONFIG_X86_PAE */
> -
> -static void map_ldt_struct_to_user(struct mm_struct *mm)
> -{
> -	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -
> -	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> -		set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> -}
> -
> -static void sanity_check_ldt_mapping(struct mm_struct *mm)
> -{
> -	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -	bool had_kernel = (pgd->pgd != 0);
> -	bool had_user   = (kernel_to_user_pgdp(pgd)->pgd != 0);
> -
> -	do_sanity_check(mm, had_kernel, had_user);
> -}
> -
> -#endif /* CONFIG_X86_PAE */
> -
>  /*
>   * If PTI is enabled, this maps the LDT into the kernelmode and
>   * usermode tables for the given mm.
> @@ -295,6 +233,8 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
>  	if (!boot_cpu_has(X86_FEATURE_PTI))
>  		return 0;
>  
> +	mm_local_region_init(mm);

Need to handle errors...

It also seems to think there's a path where we allocate a pagetable in
mm_local_region_init(), then fail without setting MMF_LOCAL_REGION_USED,
and don't free the pagetable. I can't see the path it's talking about
though.

> +
>  	/*
>  	 * Any given ldt_struct should have map_ldt_struct() called at most
>  	 * once.
> @@ -339,9 +279,6 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
>  		pte_unmap_unlock(ptep, ptl);
>  	}
>  
> -	/* Propagate LDT mapping to the user page-table */
> -	map_ldt_struct_to_user(mm);
> -
>  	ldt->slot = slot;
>  	return 0;
>  }
> @@ -390,28 +327,6 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
>  }
>  #endif /* CONFIG_MITIGATION_PAGE_TABLE_ISOLATION */
>  
> -static void free_ldt_pgtables(struct mm_struct *mm)
> -{
> -#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
> -	struct mmu_gather tlb;
> -	unsigned long start = LDT_BASE_ADDR;
> -	unsigned long end = LDT_END_ADDR;
> -
> -	if (!boot_cpu_has(X86_FEATURE_PTI))
> -		return;
> -
> -	/*
> -	 * Although free_pgd_range() is intended for freeing user
> -	 * page-tables, it also works out for kernel mappings on x86.
> -	 * We use tlb_gather_mmu_fullmm() to avoid confusing the
> -	 * range-tracking logic in __tlb_adjust_range().
> -	 */
> -	tlb_gather_mmu_fullmm(&tlb, mm);
> -	free_pgd_range(&tlb, start, end, start, end);
> -	tlb_finish_mmu(&tlb);
> -#endif
> -}
> -
>  /* After calling this, the LDT is immutable. */
>  static void finalize_ldt_struct(struct ldt_struct *ldt)
>  {
> @@ -472,7 +387,6 @@ int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
>  
>  	retval = map_ldt_struct(mm, new_ldt, 0);
>  	if (retval) {
> -		free_ldt_pgtables(mm);
>  		free_ldt_struct(new_ldt);
>  		goto out_unlock;
>  	}
> @@ -494,11 +408,6 @@ void destroy_context_ldt(struct mm_struct *mm)
>  	mm->context.ldt = NULL;
>  }
>  
> -void ldt_arch_exit_mmap(struct mm_struct *mm)
> -{
> -	free_ldt_pgtables(mm);
> -}
> -
>  static int read_ldt(void __user *ptr, unsigned long bytecount)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -645,10 +554,9 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>  		/*
>  		 * This only can fail for the first LDT setup. If an LDT is
>  		 * already installed then the PTE page is already
> -		 * populated. Mop up a half populated page table.
> +		 * populated.
>  		 */
> -		if (!WARN_ON_ONCE(old_ldt))
> -			free_ldt_pgtables(mm);
> +		WARN_ON_ONCE(!old_ldt);

That should be WARN_ON_ONCE(old_ldt);

>  		free_ldt_struct(new_ldt);
>  		goto out_unlock;
>  	}
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 2e5ecfdce73c3..e4132696c9ef2 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -111,29 +111,6 @@ static void pgd_dtor(pgd_t *pgd)
>   */
>  
>  #ifdef CONFIG_X86_PAE
> -/*
> - * In PAE mode, we need to do a cr3 reload (=tlb flush) when
> - * updating the top-level pagetable entries to guarantee the
> - * processor notices the update.  Since this is expensive, and
> - * all 4 top-level entries are used almost immediately in a
> - * new process's life, we just pre-populate them here.
> - */
> -#define PREALLOCATED_PMDS	PTRS_PER_PGD
> -
> -/*
> - * "USER_PMDS" are the PMDs for the user copy of the page tables when
> - * PTI is enabled. They do not exist when PTI is disabled.  Note that
> - * this is distinct from the user _portion_ of the kernel page tables
> - * which always exists.
> - *
> - * We allocate separate PMDs for the kernel part of the user page-table
> - * when PTI is enabled. We need them to map the per-process LDT into the
> - * user-space page-table.
> - */
> -#define PREALLOCATED_USER_PMDS	 (boot_cpu_has(X86_FEATURE_PTI) ? \
> -					KERNEL_PGD_PTRS : 0)
> -#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
> -
>  void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
>  {
>  	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
> @@ -150,12 +127,6 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
>  	 */
>  	flush_tlb_mm(mm);
>  }
> -#else  /* !CONFIG_X86_PAE */
> -
> -/* No need to prepopulate any pagetable entries in non-PAE modes. */
> -#define PREALLOCATED_PMDS	0
> -#define PREALLOCATED_USER_PMDS	 0
> -#define MAX_PREALLOCATED_USER_PMDS 0
>  #endif	/* CONFIG_X86_PAE */
>  
>  static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
> @@ -375,6 +346,9 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>  
>  void pgd_free(struct mm_struct *mm, pgd_t *pgd)
>  {
> +	/* Should be cleaned up in mmap exit path. */
> +	VM_WARN_ON_ONCE(mm_local_region_used(mm));
> +
>  	pgd_mop_up_pmds(mm, pgd);
>  	pgd_dtor(pgd);
>  	paravirt_pgd_free(mm, pgd);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 70747b53c7da9..413dc707cff9b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -906,6 +906,19 @@ static inline void mm_flags_clear_all(struct mm_struct *mm)
>  	bitmap_zero(ACCESS_PRIVATE(&mm->flags, __mm_flags), NUM_MM_FLAG_BITS);
>  }
>  
> +#ifdef CONFIG_MM_LOCAL_REGION
> +static inline bool mm_local_region_used(struct mm_struct *mm)
> +{
> +	return mm_flags_test(MMF_LOCAL_REGION_USED, mm);
> +}
> +#else
> +static inline bool mm_local_region_used(struct mm_struct *mm)
> +{
> +	VM_WARN_ON_ONCE(mm_flags_test(MMF_LOCAL_REGION_USED, mm));
> +	return false;
> +}
> +#endif
> +
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
>  
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index cee934c6e78ec..0ca7cb7da918f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1944,6 +1944,8 @@ enum {
>  
>  #define MMF_USER_HWCAP		32	/* user-defined HWCAPs */
>  
> +#define MMF_LOCAL_REGION_USED	33
> +
>  #define MMF_INIT_LEGACY_MASK	(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
>  				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>  				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 68cf0109dde3c..ff075c74333fe 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1153,6 +1153,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  fail_nocontext:
>  	mm_free_id(mm);
>  fail_noid:
> +	WARN_ON_ONCE(mm_local_region_used(mm));
>  	mm_free_pgd(mm);
>  fail_nopgd:
>  	futex_hash_free(mm);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebd8ea353687e..2813059df9c1c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1319,6 +1319,10 @@ config SECRETMEM
>  	default y
>  	bool "Enable memfd_secret() system call" if EXPERT
>  	depends on ARCH_HAS_SET_DIRECT_MAP
> +	# Soft dependency, for optimisation.
> +	imply MM_LOCAL_REGION
> +	imply MERMAP
> +	imply PAGE_ALLOC_UNMAPPED
>  	help
>  	  Enable the memfd_secret() system call with the ability to create
>  	  memory areas visible only in the context of the owning process and
> @@ -1471,6 +1475,13 @@ config LAZY_MMU_MODE_KUNIT_TEST
>  
>  	  If unsure, say N.
>  
> +config ARCH_SUPPORTS_MM_LOCAL_REGION
> +	def_bool n
> +
> +config MM_LOCAL_REGION
> +	bool
> +	depends on ARCH_SUPPORTS_MM_LOCAL_REGION
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-03-25 14:23 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-03-20 19:42   ` Dave Hansen
2026-03-23 11:01     ` Brendan Jackman
2026-03-24 15:27   ` Borislav Petkov
2026-03-25 13:28     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-03-20 19:47   ` Dave Hansen
2026-03-23 12:01     ` Brendan Jackman
2026-03-23 12:57       ` Brendan Jackman
2026-03-25 14:23   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
2026-03-24  8:00   ` kernel test robot
2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox