From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Brendan Jackman <jackmanb@google.com>,
Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>, Wei Xu <weixugc@google.com>,
Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
Lorenzo Stoakes <ljs@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
rppt@kernel.org, Sumit Garg <sumit.garg@oss.qualcomm.com>,
derkling@google.com, reijiw@google.com,
Will Deacon <will@kernel.org>,
rientjes@google.com, "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
patrick.roy@linux.dev, "Itazuri, Takahiro" <itazur@amazon.co.uk>,
Andy Lutomirski <luto@kernel.org>,
David Kaplan <david.kaplan@amd.com>,
Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
Date: Wed, 13 May 2026 17:43:53 +0200 [thread overview]
Message-ID: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> (raw)
In-Reply-To: <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>
On 3/20/26 19:23, Brendan Jackman wrote:
> Currently __GFP_UNMAPPED allocs will always fail because, although the
> lists exist to hold them, there is no way to actually create an unmapped
> page block. This commit adds one, and also the logic to map it back
> again when that's needed.
>
> Doing this at pageblock granularity ensures that the pageblock flags can
> be used to infer which freetype a page belongs to. It also provides nice
> batching of TLB flushes, and also avoids creating too much unnecessary
> TLB fragmentation in the physmap.
>
> There are some functional requirements for flipping a block:
>
> - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
>
> - Because the main usecase of this feature is to protect against CPU
> exploits, when a block is mapped it needs to be zeroed to ensure no
> residual data is available to attackers. Zeroing a block with a
> spinlock held seems undesirable.
Did I overlook something or this patch doesn't do this whole block zeroing?
Or is it handled by set_direct_map_valid_noflush itself?
> - Updating the pagetables might require allocating a pagetable to break
> down a huge page. This would deadlock if the zone lock was held.
>
> This makes allocations that need to change sensitivity _somewhat_
> similar to those that need to fallback to a different migratetype. But,
> the locking requirements mean that this can't just be squashed into the
> existing "fallback" allocator logic, instead a new allocator path just
> for this purpose is needed.
>
> The new path is assumed to be much cheaper than the really heavyweight
> stuff like compaction and reclaim. But at present it is treated as less
Uhh, speaking of compaction and reclaim... we rely on finding a whole free
pageblock in order to flip it. If that doesn't exist, the whole
get_page_from_freelist() will fail, and we might enter the
reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
ultimately want an order-0 allocation, there won't be any compaction
attempted, because that code won't know we failed to flip a pageblock. And
the watermarks might look good and prevent reclaim as well I think? We
should somehow indicate this, and handle accordingly. Might not be trivial.
Or maybe reuse pageblock isolation code to do the migrations directly in
__rmqueue_direct_map?
> desirable than the mobility-related "fallback" and "stealing" logic.
> This might turn out to need revision (in particular, maybe it's a
> problem that __rmqueue_steal(), which causes fragmentation, happens
> before __rmqueue_direct_map()), but that should be treated as a subsequent
> optimisation project.
>
> This currently forbids __GFP_ZERO, this is just to keep the patch from
> getting too large, the next patch will remove this restriction.
>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
> include/linux/gfp.h | 11 +++-
> mm/Kconfig | 4 +-
> mm/page_alloc.c | 171 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 170 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 34a38c420e84a..2d8279c6300d3 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -24,6 +24,7 @@ struct mempolicy;
> static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
> {
> int migratetype;
> + unsigned int ft_flags = 0;
>
> VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
> @@ -40,7 +41,15 @@ static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
> >> GFP_MOVABLE_SHIFT;
> }
>
> - return migrate_to_freetype(migratetype, 0);
> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
> + if (gfp_flags & __GFP_UNMAPPED) {
> + if (WARN_ON_ONCE(migratetype != MIGRATE_UNMOVABLE))
> + migratetype = MIGRATE_UNMOVABLE;
> + ft_flags |= FREETYPE_UNMAPPED;
> + }
> +#endif
> +
> + return migrate_to_freetype(migratetype, ft_flags);
> }
> #undef GFP_MOVABLE_MASK
> #undef GFP_MOVABLE_SHIFT
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b915af74d33cc..e4cb52149acad 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1505,8 +1505,8 @@ config MERMAP_KUNIT_TEST
>
> If unsure, say N.
>
> -endmenu
> -
> config PAGE_ALLOC_UNMAPPED
> bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
> default COMPILE_TEST
> +
> +endmenu
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 83d06a6db6433..710ee9f46d467 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -34,6 +34,7 @@
> #include <linux/folio_batch.h>
> #include <linux/memory_hotplug.h>
> #include <linux/nodemask.h>
> +#include <linux/set_memory.h>
> #include <linux/vmstat.h>
> #include <linux/fault-inject.h>
> #include <linux/compaction.h>
> @@ -1002,6 +1003,26 @@ static void change_pageblock_range(struct page *pageblock_page,
> }
> }
>
> +/*
> + * Can pages of these two freetypes be combined into a single higher-order free
> + * page?
> + */
> +static inline bool can_merge_freetypes(freetype_t a, freetype_t b)
> +{
> + if (freetypes_equal(a, b))
> + return true;
> +
> + if (!migratetype_is_mergeable(free_to_migratetype(a)) ||
> + !migratetype_is_mergeable(free_to_migratetype(b)))
> + return false;
> +
> + /*
> + * Mustn't "just" merge pages with different freetype flags, changing
> + * those requires updating pagetables.
> + */
> + return freetype_flags(a) == freetype_flags(b);
> +}
> +
> /*
> * Freeing function for a buddy system allocator.
> *
> @@ -1070,9 +1091,7 @@ static inline void __free_one_page(struct page *page,
> buddy_ft = get_pfnblock_freetype(buddy, buddy_pfn);
> buddy_mt = free_to_migratetype(buddy_ft);
>
> - if (migratetype != buddy_mt &&
> - (!migratetype_is_mergeable(migratetype) ||
> - !migratetype_is_mergeable(buddy_mt)))
> + if (!can_merge_freetypes(freetype, buddy_ft))
> goto done_merging;
> }
>
> @@ -1089,7 +1108,9 @@ static inline void __free_one_page(struct page *page,
> /*
> * Match buddy type. This ensures that an
> * expand() down the line puts the sub-blocks
> - * on the right freelists.
> + * on the right freelists. Freetype flags are
> + * already set correctly because of
> + * can_merge_freetypes().
> */
> change_pageblock_range(buddy, order, migratetype);
> }
> @@ -1982,6 +2003,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> struct free_area *area;
> struct page *page;
>
> + if (freetype_idx(freetype) < 0)
> + return NULL;
> +
> /* Find a page of the appropriate size in the preferred list */
> for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
> enum migratetype migratetype = free_to_migratetype(freetype);
> @@ -3324,6 +3348,119 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
> #endif
> }
>
> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
> +static inline struct page *
> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
> + unsigned int alloc_flags, freetype_t freetype)
> +{
> + unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
> + freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
> + ft_flags_other);
> + bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
> + enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
not just the unmapped flag?
> + unsigned long irq_flags;
> + int nr_pageblocks;
> + struct page *page;
> + int alloc_order;
> + int err;
> +
> + if (freetype_idx(ft_other) < 0)
> + return NULL;
> +
> + /*
> + * Might need a TLB shootdown. Even if IRQs are on this isn't
> + * safe if the caller holds a lock (in case the other CPUs need that
> + * lock to handle the shootdown IPI).
> + */
> + if (alloc_flags & ALLOC_NOBLOCK)
> + return NULL;
> +
> + if (!can_set_direct_map())
> + return NULL;
> +
> + lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
> +
> + /*
> + * Need to [un]map a whole pageblock (otherwise it might require
> + * allocating pagetables). First allocate it.
> + */
> + alloc_order = max(request_order, pageblock_order);
> + nr_pageblocks = 1 << (alloc_order - pageblock_order);
> + zone_lock_irqsave(zone, irq_flags);
> + page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
> + zone_unlock_irqrestore(zone, irq_flags);
> + if (!page)
> + return NULL;
> +
> + /*
> + * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
> + * released the zone lock it's possible to allocate a pagetable if
> + * needed to split up a huge page.
> + *
> + * Note that modifying the direct map may need to allocate pagetables.
> + * What about unbounded recursion? Here are the assumptions that make it
> + * safe:
> + *
> + * - The direct map starts out fully mapped at boot. (This is not really
> + * an assumption" as its in direct control of page_alloc.c).
> + *
> + * - Once pages in the direct map are broken down, they are not
> + * re-aggregated into larger pages again.
> + *
> + * - Pagetables are never allocated with __GFP_UNMAPPED.
> + *
> + * Under these assumptions, a pagetable might need to be allocated while
> + * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
> + * allocation. But, the allocation of that pagetable never requires
> + * allocating a further pagetable.
> + */
> + err = set_direct_map_valid_noflush(page,
> + nr_pageblocks << pageblock_order, want_mapped);
> + if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
> + zone_lock_irqsave(zone, irq_flags);
> + __free_one_page(page, page_to_pfn(page), zone,
> + alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
> + zone_unlock_irqrestore(zone, irq_flags);
> + return NULL;
> + }
> +
> + if (!want_mapped) {
> + unsigned long start = (unsigned long)page_address(page);
> + unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
> +
> + flush_tlb_kernel_range(start, end);
> + }
> +
> + for (int i = 0; i < nr_pageblocks; i++) {
> + struct page *block_page = page + (pageblock_nr_pages * i);
> +
> + set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
> + }
> +
> + if (request_order >= alloc_order)
> + return page;
> +
> + /* Free any remaining pages in the block. */
> + zone_lock_irqsave(zone, irq_flags);
> + for (unsigned int i = request_order; i < alloc_order; i++) {
> + struct page *page_to_free = page + (1 << i);
> +
> + __free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
> + i, freetype, FPI_SKIP_REPORT_NOTIFY);
> + }
Could expand() be used here?
> + zone_unlock_irqrestore(zone, irq_flags);
> +
> + return page;
> +}
> +#else /* CONFIG_PAGE_ALLOC_UNMAPPED */
> +static inline struct page *__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
> + unsigned int alloc_flags, freetype_t freetype)
> +{
> + return NULL;
> +}
> +#endif /* CONFIG_PAGE_ALLOC_UNMAPPED */
> +
> static __always_inline
> struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> unsigned int order, unsigned int alloc_flags,
> @@ -3331,8 +3468,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> {
> struct page *page;
> unsigned long flags;
> - freetype_t ft_high = freetype_with_migrate(freetype,
> - MIGRATE_HIGHATOMIC);
> + freetype_t ft_high = freetype_with_migrate(freetype, MIGRATE_HIGHATOMIC);
>
> do {
> page = NULL;
> @@ -3357,13 +3493,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> */
> if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_HARDER)))
> page = __rmqueue_smallest(zone, order, ft_high);
> -
> - if (!page) {
> - zone_unlock_irqrestore(zone, flags);
> - return NULL;
> - }
> }
> zone_unlock_irqrestore(zone, flags);
> +
> + /* Try changing direct map, now we've released the zone lock */
> + if (!page)
> + page = __rmqueue_direct_map(zone, order, alloc_flags, freetype);
> + if (!page)
> + return NULL;
> +
> } while (check_new_pages(page, order));
>
> __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> @@ -3587,6 +3725,8 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
> static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> bool force)
> {
> + freetype_t ft_high = freetype_with_migrate(ac->freetype,
> + MIGRATE_HIGHATOMIC);
> struct zonelist *zonelist = ac->zonelist;
> unsigned long flags;
> struct zoneref *z;
> @@ -3595,6 +3735,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> int order;
> int ret;
>
> + if (freetype_idx(ft_high) < 0)
> + return false;
> +
> for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
> ac->nodemask) {
> /*
> @@ -3608,8 +3751,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
> zone_lock_irqsave(zone, flags);
> for (order = 0; order < NR_PAGE_ORDERS; order++) {
> struct free_area *area = &(zone->free_area[order]);
> - freetype_t ft_high = freetype_with_migrate(ac->freetype,
> - MIGRATE_HIGHATOMIC);
> unsigned long size;
>
> page = get_page_from_free_area(area, ft_high);
> @@ -5109,6 +5250,10 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> ac->nodemask = nodemask;
> ac->freetype = gfp_freetype(gfp_mask);
>
> + /* Not implemented yet. */
> + if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED && gfp_mask & __GFP_ZERO)
> + return false;
> +
> if (cpusets_enabled()) {
> *alloc_gfp |= __GFP_HARDWALL;
> /*
>
next prev parent reply other threads:[~2026-05-13 15:44 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com>
[not found] ` <20260320-page_alloc-unmapped-v2-8-28bf1bd54f41@google.com>
2026-05-11 13:46 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-9-28bf1bd54f41@google.com>
2026-05-11 13:51 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Vlastimil Babka (SUSE)
2026-05-11 16:44 ` Brendan Jackman
2026-05-11 16:53 ` Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-11-28bf1bd54f41@google.com>
2026-05-11 15:35 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-12-28bf1bd54f41@google.com>
2026-05-11 18:01 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-13-28bf1bd54f41@google.com>
2026-05-11 18:07 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-10-28bf1bd54f41@google.com>
2026-05-11 15:34 ` [PATCH v2 10/22] mm: introduce freetype_t Vlastimil Babka (SUSE)
2026-05-11 16:49 ` Brendan Jackman
2026-05-11 16:58 ` Vlastimil Babka (SUSE)
2026-05-11 18:17 ` Vlastimil Babka (SUSE)
2026-05-11 18:26 ` Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-14-28bf1bd54f41@google.com>
2026-05-11 18:29 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-15-28bf1bd54f41@google.com>
2026-05-11 18:30 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Vlastimil Babka (SUSE)
2026-05-12 9:49 ` Brendan Jackman
[not found] ` <20260320-page_alloc-unmapped-v2-16-28bf1bd54f41@google.com>
2026-05-13 8:46 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-18-28bf1bd54f41@google.com>
2026-05-13 9:43 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Vlastimil Babka (SUSE)
[not found] ` <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>
2026-05-13 15:43 ` Vlastimil Babka (SUSE) [this message]
2026-05-13 16:17 ` [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Gregory Price
2026-05-13 17:14 ` Brendan Jackman
2026-05-13 17:28 ` Gregory Price
2026-05-13 17:38 ` Vlastimil Babka (SUSE)
2026-05-13 17:59 ` Gregory Price
[not found] ` <20260320-page_alloc-unmapped-v2-20-28bf1bd54f41@google.com>
2026-05-13 17:00 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Vlastimil Babka (SUSE)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org \
--to=vbabka@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david.kaplan@amd.com \
--cc=david@kernel.org \
--cc=derkling@google.com \
--cc=hannes@cmpxchg.org \
--cc=itazur@amazon.co.uk \
--cc=jackmanb@google.com \
--cc=kalyazin@amazon.co.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=luto@kernel.org \
--cc=patrick.roy@linux.dev \
--cc=peterz@infradead.org \
--cc=reijiw@google.com \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=sumit.garg@oss.qualcomm.com \
--cc=tglx@kernel.org \
--cc=weixugc@google.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
--cc=yosry@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox