public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v2 00/22] mm: Add __GFP_UNMAPPED
@ 2026-03-20 18:23 Brendan Jackman
  2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
                   ` (21 more replies)
  0 siblings, 22 replies; 33+ messages in thread
From: Brendan Jackman @ 2026-03-20 18:23 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Vlastimil Babka, Wei Xu, Johannes Weiner,
	Zi Yan, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, x86, rppt, Sumit Garg, derkling, reijiw,
	Will Deacon, rientjes, Kalyazin, Nikita, patrick.roy,
	Itazuri, Takahiro, Andy Lutomirski, David Kaplan, Thomas Gleixner,
	Brendan Jackman, Yosry Ahmed

.:: What? Why?

This series adds support for efficiently allocating pages that are not
present in the direct map. This is instrumental to two different
immediate goals:

1. This supports the effort to remove guest_memfd memory from the direct
   map [0]. One of the challenges faced in that effort has been
   efficiently eliminating TLB entries, this series offers a solution to
   that problem

2. Address Space Isolation (ASI) [1] also needs an efficient way to
   allocate pages that are missing from the direct map. Although for ASI
   the needs are slightly different (in that case, the pages need only
   be removed from ASI's special pagetables), the most interesting mm
   challenges are basically the same.

   So, __GFP_UNMAPPED serves as a Trojan horse to get the page allocator
   into a state where adding ASI's features "Should Be Easy".

   This series _also_ serves as a Trojan horse for the "mermap" (details
   below) which is also a key building block for making ASI efficient.

Longer term, there are a wide range of security techniques unlocked by
being able to efficiently remove pages from the direct map. This ranges
from straightforward nice-to-haves like removing unused aliases of the
vmalloc area, to more ambitious ideas like totally unmapping _all_ user
memory.

There may also be non-security usecases for this feature, for example
at LPC Sumit Garg presented an issue with memory-firewalled client
devices that could be remediated by __GFP_UNMAPPED [2]. It seems also
likely to be a key step towards dropping execmem as a distinct
allocation layer. 

.:: Design

The key design elements introduced here are just repurposed from
previous attempts to directly introduce ASI's needs to the page
allocator [3]. The only real difference is that now these support
totally unmapping stuff from the direct map, instead of only unmapping
it from ASI's special pagetables.

Note that it's important that allocating unmapped memory be fully
scalable, in principle supporting basically everything the mm can do -
previous __GFP_UNMAPPED proposals have implemented it as a cache on top
of the page allocator. That won't do here because the end goal here is
protecting user data. That means that ultimately, the vast majority of
memory in the system ought to be allocated through a mechanism in the
vein of __GFP_UNMAPPED, so a really full-featured allocator is required.
(In this series, __GFP_UNMAPPED is only supported for unmovable
allocations, but that's just to save space in datastructures: the
implementation is supposed to be fully generic, eventually allowing
reclaim and compaction for the unmapped pages).

Because of the above, I'm proposing a new GFP flag. However the only
part of that I'm really attached to is that it's an interface to the
real page allocator. For now I still assume a GFP flag is the cleanest
way to get that but in principle I'm not opposed to
alloc_unmapped_pages() or whatever. Alternatively, there's some
discussion in [9] about how to get more GFP space back, which I'm happy
to help with.

.:::: Design: Introducing "freetypes"

The biggest challenge for efficiently getting stuff out of the direct
map is TLB flushing. Pushing this problem into the page allocator turns
out to enable amortising that flush cost into almost nothing. The core
idea is to have pools of already-unmapped pages. We'd like those pages
to be physically contiguous so they don't unduly fragment the pagetables
around them, and we'd like to be able to efficiently look up these
already-unmapped pages during allocation. The page allocator already has
deeply-ingrained functionality for physically grouping pages by a
certain attribute, and then indexing free pages by that attribute, this
mechanism is: migratetypes.

So basically, this series extends the concepts of migratetypes in the
allocator so that as well as just representing mobility, they can
represent other properties of the page too. (Actually, migratetypes are
already sort of overloaded, but the main extension is to be able to
represent _orthogonal_ properties). In order to avoid further

Because of these ambitious usecases, it's core to this proposal that the
feature
overloading the concept of a migratetype, this extension is done by
adding a new concept on top of migratetype: the _freetype_. A freetype
is basically just a migratetype plus some flags, and it replaces
migratetypes wherever the latter is currently used as to index free
pages.

The first freetype flag is then added, which marks the pages it indexes
as being absent from the direct map. This is then used to implement the
new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
the new flag, or unmaps pages if no existing ones are already available.

.:::: Design: Introducing the "mermap"

Sharp readers might by now be asking how __GFP_UNMAPPED interacts with
__GFP_ZERO. If pages aren't in the direct map, how can the page
allocator zero them? The solution is the "mermap", short for "epheMERal
mapping". The mermap provides an efficient way to temporarily map pages
into the local address space, and the allocator uses these mappings to
zero pages.

Using the mermap securely requires some knowledge about the usage of the
pages. One slightly awkward part of this design is that the page
allocator's usage of the mermap then "leaks" out so that callers who
allocate with __GFP_UNMAPPED|__GFP_ZERO need to be aware of the mermap's
security implications. For the guest_memfd unmapping usecase, that means
when guest_memfd.c makes these special allocations, it is only safe
because the pages will belong to the current process. In other words,
the use of the mermap potentially allows that process to leak the pages
via CPU sidechannels (unless more holistic/expensive mitigations are
enabled).

Since this cover letter is already too long I won't describe most
details of the mermap here, please see the patch that introduces it.

But one key detail is that it requires a kernel-space but mm-local
virtual address region. So... this series adds that too (for x86). This
is called the mm-local region and is implemented by "just" extending and
generalising the LDT remap area.

.:: Outline of the patchset

- Patches  1 ->  2 introduce the mm-local region for x86

- Patches  3 ->  5 are prep patches for the mermap

- Patches  6 ->  7 introduce the mermap

- Patches  8 -> 11 introduce freetypes

  - Patch 8 in particular is the big annoying switch-over which changes
    a whole bunch of code from "migratetype" to "freetype". In order to
    try and have the compiler help out with catching bugs, this is done
    with an annoying typedef. I'm sorry that this patch is so annoying,
    but I think if we do want to extend the allocator along these lines
    then a typedef + big annoying patch is probably the safest way.

- Patches 12 -> 21 introduce __GFP_UNMAPPED

- Patch 22 adopts __GFP_UNMAPPED as for secretmem.

.:: Hacky bits

.:::: Hacky bits: mermap pagetable management

As discussed in [8], I can't find any existing pagetable management code
that serves the needs of the mermap. I have responded to this with some
extensions to __apply_to_page_range(). Those extensions make it serve
the mermap's needs re avoiding allocation and locking. But they don't
offer an efficient way to preallocate empty pagetables in
mermap_mm_prepare(). For now I think the inefficient way fine but if a
workload emerges that is sensitive to the latency of mermap setup, it
will need to be optimised. I think that would best be done as part of
the general pagetable library I hope to discuss in [8].

.:::: Hacky bits: simplistic secretmem integration

The secretmem integration leaves the mmain optimisations on the table;
the security-required flushes of the mermap areas are implemented via
distinct tlb_flush_mm() calls. It should be possible to amortize the
mermap TLB flushes completely into the normal VMA flushing. However, as
far as I know there is no performance-sensitive usecase for secretmem.
So, I've just implemented the minimal adoption. This will at least avoid
fragmentation of the direct map, even if it doesn't reduce TLB flushing.
If anyone knows of a workload that might benefit from dropping that
flushing, let me know!

.:: Performance

In [4] is a branch containing: 

1. This series.

2. All the key kernel patches from the Firecracker team's "secret-free"
   effort, which includes guest_memfd unmapping ([0]).

3. Some prototype patches to switch guest_memfd over from an ad-hoc
   unmapping logic to use of __GFP_UNMAPPED (plus direct use of the
   mermap to implement write()).

I benchmarked this using Firecracker's own performance tests [4], which
measure the time required to populate the VM guest's memory. This
population happens via write() so it exercises the mermap. I ran this on
a Sapphire Rapids machine [5]. The baseline here is just the secret-free
patches on their own. "gfp_unmapped" is the branch described above.
"skip-flush" provides a reference against an implementation that just
skips flushing the TLB when unmapping guest_memfd pages, which serves as
an upper-bound on performance.

metric: populate_latency (ms)   |  test: firecracker-perf-tests-wrapped
+---------------+---------+----------+----------+------------------------+----------+--------+
| nixos_variant | samples |     mean |      min | histogram              |      max | Δμ     |
+---------------+---------+----------+----------+------------------------+----------+--------+
|               |      30 |    1.04s |    1.02s |                     █  |    1.10s |        |
| gfp_unmapped  |      30 | 313.02ms | 299.48ms |       █                | 343.25ms | -70.0% |
| skip-flush    |      30 | 325.80ms | 307.91ms |       █                | 333.30ms | -68.8% |
+---------------+---------+----------+----------+------------------------+----------+--------+

Conclusion: it's close to the best case performance for this particular
workload. (Note in the sample above the mean is actually faster - that's
noise, this isn't a consistent observation).

[0] [PATCH v10 00/15] Direct Map Removal Support for guest_memfd
    https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@amazon.com/

[1] https://linuxasi.dev/

[2] https://lpc.events/event/19/contributions/2095/

[3] https://lore.kernel.org/lkml/20250313-asi-page-alloc-v1-0-04972e046cea@google.com/

[4] https://github.com/bjackman/kernel-benchmarks-nix/blob/fd56c93344760927b71161368230a15741a5869f/packages/benchmarks/firecracker-perf-tests/firecracker-perf-tests.sh

[5] https://github.com/bjackman/aethelred/blob/eb0dd0e99ee08fa0534733113e93b89499affe91

[6] https://github.com/bjackman/kernel-benchmarks-nix/blob/924a5cb3215360e4620eafcd7e00eb4e69e2ee93/packages/benchmarks/stress-ng/default.nix#L18

[7] https://lore.kernel.org/lkml/20230308094106.227365-1-rppt@kernel.org/

[8] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@google.com/

[9] https://lore.kernel.org/lkml/20260319-gfp64-v1-0-2c73b8d42b7f@google.com/

Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org
Cc: rppt@kernel.org
Cc: Sumit Garg <sumit.garg@oss.qualcomm.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Andrew Morton <akpm@linux-foundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Wei Xu <weixugc@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: yosryahmed@google.com
Cc: derkling@google.com
Cc: reijiw@google.com
Cc: Will Deacon <will@kernel.org>
Cc: rientjes@google.com
Cc: "Kalyazin, Nikita" <kalyazin@amazon.co.uk>
Cc: patrick.roy@linux.dev
Cc: "Itazuri, Takahiro" <itazur@amazon.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Kaplan <david.kaplan@amd.com>
Cc: Thomas Gleixner <tglx@kernel.org>

Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
Changes in v2:
- Adopted it for secretmem
- Fixed mm-local region under KPTI on PAE
- Added __apply_to_get_range() mess
- Added fault injection for mermap_get(), fixed bug this uncovered
- Fixed various silly bugs
- Link to v1 (RFC): https://lore.kernel.org/r/20260225-page_alloc-unmapped-v1-0-e8808a03cd66@google.com

---
Brendan Jackman (22):
      x86/mm: split out preallocate_sub_pgd()
      x86/mm: Generalize LDT remap into "mm-local region"
      x86/tlb: Expose some flush function declarations to modules
      mm: Create flags arg for __apply_to_page_range()
      mm: Add more flags for __apply_to_page_range()
      x86/mm: introduce the mermap
      mm: KUnit tests for the mermap
      mm: introduce for_each_free_list()
      mm/page_alloc: don't overload migratetype in find_suitable_fallback()
      mm: introduce freetype_t
      mm: move migratetype definitions to freetype.h
      mm: add definitions for allocating unmapped pages
      mm: rejig pageblock mask definitions
      mm: encode freetype flags in pageblock flags
      mm/page_alloc: remove ifdefs from pindex helpers
      mm/page_alloc: separate pcplists by freetype flags
      mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
      mm/page_alloc: introduce ALLOC_NOBLOCK
      mm/page_alloc: implement __GFP_UNMAPPED allocations
      mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
      mm: Minimal KUnit tests for some new page_alloc logic
      mm/secretmem: Use __GFP_UNMAPPED when available

 Documentation/arch/x86/x86_64/mm.rst    |   4 +-
 arch/x86/Kconfig                        |   3 +
 arch/x86/include/asm/mermap.h           |  23 +
 arch/x86/include/asm/mmu_context.h      | 119 ++++-
 arch/x86/include/asm/page.h             |  32 ++
 arch/x86/include/asm/pgalloc.h          |  33 ++
 arch/x86/include/asm/pgtable_32_areas.h |   9 +-
 arch/x86/include/asm/pgtable_64_types.h |  18 +-
 arch/x86/include/asm/pgtable_types.h    |   2 +
 arch/x86/include/asm/tlbflush.h         |  43 +-
 arch/x86/kernel/ldt.c                   | 130 +-----
 arch/x86/mm/init_64.c                   |  44 +-
 arch/x86/mm/pgtable.c                   |  32 +-
 include/linux/freetype.h                | 147 ++++++
 include/linux/gfp.h                     |  25 +-
 include/linux/gfp_types.h               |  26 ++
 include/linux/mermap.h                  |  63 +++
 include/linux/mermap_types.h            |  44 ++
 include/linux/mm.h                      |  13 +
 include/linux/mm_types.h                |   6 +
 include/linux/mmzone.h                  |  84 ++--
 include/linux/pageblock-flags.h         |  16 +-
 include/trace/events/mmflags.h          |   9 +-
 kernel/fork.c                           |   6 +
 kernel/panic.c                          |   2 +
 kernel/power/snapshot.c                 |   8 +-
 mm/Kconfig                              |  43 ++
 mm/Makefile                             |   3 +
 mm/compaction.c                         |  36 +-
 mm/init-mm.c                            |   3 +
 mm/internal.h                           |  70 ++-
 mm/memory.c                             |  76 +--
 mm/mermap.c                             | 334 ++++++++++++++
 mm/mm_init.c                            |  11 +-
 mm/page_alloc.c                         | 788 +++++++++++++++++++++++---------
 mm/page_isolation.c                     |   2 +-
 mm/page_owner.c                         |   7 +-
 mm/page_reporting.c                     |   4 +-
 mm/pgalloc-track.h                      |   6 +
 mm/secretmem.c                          |  87 +++-
 mm/show_mem.c                           |   4 +-
 mm/tests/mermap_kunit.c                 | 250 ++++++++++
 mm/tests/page_alloc_kunit.c             | 250 ++++++++++
 43 files changed, 2341 insertions(+), 574 deletions(-)
---
base-commit: b5d083a3ed1e2798396d5e491432e887da8d4a06
change-id: 20260112-page_alloc-unmapped-944fe5d7b55c

Best regards,
-- 
Brendan Jackman <jackmanb@google.com>



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2026-03-26 16:14 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-03-20 19:42   ` Dave Hansen
2026-03-23 11:01     ` Brendan Jackman
2026-03-24 15:27   ` Borislav Petkov
2026-03-25 13:28     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-03-20 19:47   ` Dave Hansen
2026-03-23 12:01     ` Brendan Jackman
2026-03-23 12:57       ` Brendan Jackman
2026-03-25 14:23   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
2026-03-26 16:14   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
2026-03-24  8:00   ` kernel test robot
2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox