Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Brendan Jackman <jackmanb@google.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>, Wei Xu <weixugc@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	Lorenzo Stoakes <ljs@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	rppt@kernel.org, Sumit Garg <sumit.garg@oss.qualcomm.com>,
	derkling@google.com, reijiw@google.com,
	Will Deacon <will@kernel.org>,
	rientjes@google.com, "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	patrick.roy@linux.dev, "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
Date: Wed, 13 May 2026 17:43:53 +0200	[thread overview]
Message-ID: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> (raw)
In-Reply-To: <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>

On 3/20/26 19:23, Brendan Jackman wrote:
> Currently __GFP_UNMAPPED allocs will always fail because, although the
> lists exist to hold them, there is no way to actually create an unmapped
> page block. This commit adds one, and also the logic to map it back
> again when that's needed.
> 
> Doing this at pageblock granularity ensures that the pageblock flags can
> be used to infer which freetype a page belongs to. It also provides nice
> batching of TLB flushes, and also avoids creating too much unnecessary
> TLB fragmentation in the physmap.
> 
> There are some functional requirements for flipping a block:
> 
>  - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
> 
>  - Because the main usecase of this feature is to protect against CPU
>    exploits, when a block is mapped it needs to be zeroed to ensure no
>    residual data is available to attackers. Zeroing a block with a
>    spinlock held seems undesirable.

Did I overlook something or this patch doesn't do this whole block zeroing?
Or is it handled by set_direct_map_valid_noflush itself?

>  - Updating the pagetables might require allocating a pagetable to break
>    down a huge page. This would deadlock if the zone lock was held.
> 
> This makes allocations that need to change sensitivity _somewhat_
> similar to those that need to fallback to a different migratetype. But,
> the locking requirements mean that this can't just be squashed into the
> existing "fallback" allocator logic, instead a new allocator path just
> for this purpose is needed.
> 
> The new path is assumed to be much cheaper than the really heavyweight
> stuff like compaction and reclaim. But at present it is treated as less

Uhh, speaking of compaction and reclaim... we rely on finding a whole free
pageblock in order to flip it. If that doesn't exist, the whole
get_page_from_freelist() will fail, and we might enter the
reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
ultimately want an order-0 allocation, there won't be any compaction
attempted, because that code won't know we failed to flip a pageblock. And
the watermarks might look good and prevent reclaim as well I think? We
should somehow indicate this, and handle accordingly. Might not be trivial.
Or maybe reuse pageblock isolation code to do the migrations directly in
__rmqueue_direct_map?

> desirable than the mobility-related "fallback" and "stealing" logic.
> This might turn out to need revision (in particular, maybe it's a
> problem that __rmqueue_steal(), which causes fragmentation, happens
> before __rmqueue_direct_map()), but that should be treated as a subsequent
> optimisation project.
> 
> This currently forbids __GFP_ZERO, this is just to keep the patch from
> getting too large, the next patch will remove this restriction.
> 
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  include/linux/gfp.h |  11 +++-
>  mm/Kconfig          |   4 +-
>  mm/page_alloc.c     | 171 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 170 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 34a38c420e84a..2d8279c6300d3 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -24,6 +24,7 @@ struct mempolicy;
>  static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
>  {
>  	int migratetype;
> +	unsigned int ft_flags = 0;
>  
>  	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
>  	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
> @@ -40,7 +41,15 @@ static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
>  			>> GFP_MOVABLE_SHIFT;
>  	}
>  
> -	return migrate_to_freetype(migratetype, 0);
> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
> +	if (gfp_flags & __GFP_UNMAPPED) {
> +		if (WARN_ON_ONCE(migratetype != MIGRATE_UNMOVABLE))
> +			migratetype = MIGRATE_UNMOVABLE;
> +		ft_flags |= FREETYPE_UNMAPPED;
> +	}
> +#endif
> +
> +	return migrate_to_freetype(migratetype, ft_flags);
>  }
>  #undef GFP_MOVABLE_MASK
>  #undef GFP_MOVABLE_SHIFT
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b915af74d33cc..e4cb52149acad 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1505,8 +1505,8 @@ config MERMAP_KUNIT_TEST
>  
>  	  If unsure, say N.
>  
> -endmenu
> -
>  config PAGE_ALLOC_UNMAPPED
>  	bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
>  	default COMPILE_TEST
> +
> +endmenu
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 83d06a6db6433..710ee9f46d467 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -34,6 +34,7 @@
>  #include <linux/folio_batch.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/nodemask.h>
> +#include <linux/set_memory.h>
>  #include <linux/vmstat.h>
>  #include <linux/fault-inject.h>
>  #include <linux/compaction.h>
> @@ -1002,6 +1003,26 @@ static void change_pageblock_range(struct page *pageblock_page,
>  	}
>  }
>  
> +/*
> + * Can pages of these two freetypes be combined into a single higher-order free
> + * page?
> + */
> +static inline bool can_merge_freetypes(freetype_t a, freetype_t b)
> +{
> +	if (freetypes_equal(a, b))
> +		return true;
> +
> +	if (!migratetype_is_mergeable(free_to_migratetype(a)) ||
> +	    !migratetype_is_mergeable(free_to_migratetype(b)))
> +		return false;
> +
> +	/*
> +	 * Mustn't "just" merge pages with different freetype flags, changing
> +	 * those requires updating pagetables.
> +	 */
> +	return freetype_flags(a) == freetype_flags(b);
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1070,9 +1091,7 @@ static inline void __free_one_page(struct page *page,
>  			buddy_ft = get_pfnblock_freetype(buddy, buddy_pfn);
>  			buddy_mt = free_to_migratetype(buddy_ft);
>  
> -			if (migratetype != buddy_mt &&
> -			    (!migratetype_is_mergeable(migratetype) ||
> -			     !migratetype_is_mergeable(buddy_mt)))
> +			if (!can_merge_freetypes(freetype, buddy_ft))
>  				goto done_merging;
>  		}
>  
> @@ -1089,7 +1108,9 @@ static inline void __free_one_page(struct page *page,
>  			/*
>  			 * Match buddy type. This ensures that an
>  			 * expand() down the line puts the sub-blocks
> -			 * on the right freelists.
> +			 * on the right freelists. Freetype flags are
> +			 * already set correctly because of
> +			 * can_merge_freetypes().
>  			 */
>  			change_pageblock_range(buddy, order, migratetype);
>  		}
> @@ -1982,6 +2003,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  	struct free_area *area;
>  	struct page *page;
>  
> +	if (freetype_idx(freetype) < 0)
> +		return NULL;
> +
>  	/* Find a page of the appropriate size in the preferred list */
>  	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
>  		enum migratetype migratetype = free_to_migratetype(freetype);
> @@ -3324,6 +3348,119 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>  #endif
>  }
>  
> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
> +static inline struct page *
> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
> +		     unsigned int alloc_flags, freetype_t freetype)
> +{
> +	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
> +	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
> +						  ft_flags_other);
> +	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
> +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;

Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
not just the unmapped flag?

> +	unsigned long irq_flags;
> +	int nr_pageblocks;
> +	struct page *page;
> +	int alloc_order;
> +	int err;
> +
> +	if (freetype_idx(ft_other) < 0)
> +		return NULL;
> +
> +	/*
> +	 * Might need a TLB shootdown. Even if IRQs are on this isn't
> +	 * safe if the caller holds a lock (in case the other CPUs need that
> +	 * lock to handle the shootdown IPI).
> +	 */
> +	if (alloc_flags & ALLOC_NOBLOCK)
> +		return NULL;
> +
> +	if (!can_set_direct_map())
> +		return NULL;
> +
> +	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
> +
> +	/*
> +	 * Need to [un]map a whole pageblock (otherwise it might require
> +	 * allocating pagetables). First allocate it.
> +	 */
> +	alloc_order = max(request_order, pageblock_order);
> +	nr_pageblocks = 1 << (alloc_order - pageblock_order);
> +	zone_lock_irqsave(zone, irq_flags);
> +	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
> +	zone_unlock_irqrestore(zone, irq_flags);
> +	if (!page)
> +		return NULL;
> +
> +	/*
> +	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
> +	 * released the zone lock it's possible to allocate a pagetable if
> +	 * needed to split up a huge page.
> +	 *
> +	 * Note that modifying the direct map may need to allocate pagetables.
> +	 * What about unbounded recursion? Here are the assumptions that make it
> +	 * safe:
> +	 *
> +	 * - The direct map starts out fully mapped at boot. (This is not really
> +	 *   an assumption" as its in direct control of page_alloc.c).
> +	 *
> +	 * - Once pages in the direct map are broken down, they are not
> +	 *   re-aggregated into larger pages again.
> +	 *
> +	 * - Pagetables are never allocated with __GFP_UNMAPPED.
> +	 *
> +	 * Under these assumptions, a pagetable might need to be allocated while
> +	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
> +	 * allocation. But, the allocation of that pagetable never requires
> +	 * allocating a further pagetable.
> +	 */
> +	err = set_direct_map_valid_noflush(page,
> +				nr_pageblocks << pageblock_order, want_mapped);
> +	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
> +		zone_lock_irqsave(zone, irq_flags);
> +		__free_one_page(page, page_to_pfn(page), zone,
> +				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
> +		zone_unlock_irqrestore(zone, irq_flags);
> +		return NULL;
> +	}
> +
> +	if (!want_mapped) {
> +		unsigned long start = (unsigned long)page_address(page);
> +		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
> +
> +		flush_tlb_kernel_range(start, end);
> +	}
> +
> +	for (int i = 0; i < nr_pageblocks; i++) {
> +		struct page *block_page = page + (pageblock_nr_pages * i);
> +
> +		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
> +	}
> +
> +	if (request_order >= alloc_order)
> +		return page;
> +
> +	/* Free any remaining pages in the block. */
> +	zone_lock_irqsave(zone, irq_flags);
> +	for (unsigned int i = request_order; i < alloc_order; i++) {
> +		struct page *page_to_free = page + (1 << i);
> +
> +		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
> +			i, freetype, FPI_SKIP_REPORT_NOTIFY);
> +	}

Could expand() be used here?

> +	zone_unlock_irqrestore(zone, irq_flags);
> +
> +	return page;
> +}
> +#else /* CONFIG_PAGE_ALLOC_UNMAPPED */
> +static inline struct page *__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
> +				unsigned int alloc_flags, freetype_t freetype)
> +{
> +	return NULL;
> +}
> +#endif /* CONFIG_PAGE_ALLOC_UNMAPPED */
> +
>  static __always_inline
>  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			   unsigned int order, unsigned int alloc_flags,
> @@ -3331,8 +3468,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  {
>  	struct page *page;
>  	unsigned long flags;
> -	freetype_t ft_high = freetype_with_migrate(freetype,
> -						       MIGRATE_HIGHATOMIC);
> +	freetype_t ft_high = freetype_with_migrate(freetype, MIGRATE_HIGHATOMIC);
>  
>  	do {
>  		page = NULL;
> @@ -3357,13 +3493,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			 */
>  			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_HARDER)))
>  				page = __rmqueue_smallest(zone, order, ft_high);
> -
> -			if (!page) {
> -				zone_unlock_irqrestore(zone, flags);
> -				return NULL;
> -			}
>  		}
>  		zone_unlock_irqrestore(zone, flags);
> +
> +		/* Try changing direct map, now we've released the zone lock */
> +		if (!page)
> +			page = __rmqueue_direct_map(zone, order, alloc_flags, freetype);
> +		if (!page)
> +			return NULL;
> +
>  	} while (check_new_pages(page, order));
>  
>  	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> @@ -3587,6 +3725,8 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
>  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  						bool force)
>  {
> +	freetype_t ft_high = freetype_with_migrate(ac->freetype,
> +					MIGRATE_HIGHATOMIC);
>  	struct zonelist *zonelist = ac->zonelist;
>  	unsigned long flags;
>  	struct zoneref *z;
> @@ -3595,6 +3735,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  	int order;
>  	int ret;
>  
> +	if (freetype_idx(ft_high) < 0)
> +		return false;
> +
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
>  								ac->nodemask) {
>  		/*
> @@ -3608,8 +3751,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  		zone_lock_irqsave(zone, flags);
>  		for (order = 0; order < NR_PAGE_ORDERS; order++) {
>  			struct free_area *area = &(zone->free_area[order]);
> -			freetype_t ft_high = freetype_with_migrate(ac->freetype,
> -							MIGRATE_HIGHATOMIC);
>  			unsigned long size;
>  
>  			page = get_page_from_free_area(area, ft_high);
> @@ -5109,6 +5250,10 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	ac->nodemask = nodemask;
>  	ac->freetype = gfp_freetype(gfp_mask);
>  
> +	/* Not implemented yet. */
> +	if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED && gfp_mask & __GFP_ZERO)
> +		return false;
> +
>  	if (cpusets_enabled()) {
>  		*alloc_gfp |= __GFP_HARDWALL;
>  		/*
>

next prev parent reply	other threads:[~2026-05-13 15:44 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com>
     [not found] ` <20260320-page_alloc-unmapped-v2-8-28bf1bd54f41@google.com>
2026-05-11 13:46   ` [PATCH v2 08/22] mm: introduce for_each_free_list() Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-9-28bf1bd54f41@google.com>
2026-05-11 13:51   ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Vlastimil Babka (SUSE)
2026-05-11 16:44     ` Brendan Jackman
2026-05-11 16:53       ` Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-11-28bf1bd54f41@google.com>
2026-05-11 15:35   ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-12-28bf1bd54f41@google.com>
2026-05-11 18:01   ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-13-28bf1bd54f41@google.com>
2026-05-11 18:07   ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-10-28bf1bd54f41@google.com>
2026-05-11 15:34   ` [PATCH v2 10/22] mm: introduce freetype_t Vlastimil Babka (SUSE)
2026-05-11 16:49     ` Brendan Jackman
2026-05-11 16:58       ` Vlastimil Babka (SUSE)
2026-05-11 18:17   ` Vlastimil Babka (SUSE)
2026-05-11 18:26   ` Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-14-28bf1bd54f41@google.com>
2026-05-11 18:29   ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-15-28bf1bd54f41@google.com>
2026-05-11 18:30   ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Vlastimil Babka (SUSE)
2026-05-12  9:49     ` Brendan Jackman
     [not found] ` <20260320-page_alloc-unmapped-v2-16-28bf1bd54f41@google.com>
2026-05-13  8:46   ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-18-28bf1bd54f41@google.com>
2026-05-13  9:43   ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Vlastimil Babka (SUSE)
     [not found] ` <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>
2026-05-13 15:43   ` Vlastimil Babka (SUSE) [this message]
2026-05-13 16:17 ` [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Gregory Price
2026-05-13 17:14   ` Brendan Jackman
2026-05-13 17:28     ` Gregory Price
2026-05-13 17:38       ` Vlastimil Babka (SUSE)
2026-05-13 17:59         ` Gregory Price
     [not found] ` <20260320-page_alloc-unmapped-v2-20-28bf1bd54f41@google.com>
2026-05-13 17:00   ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Vlastimil Babka (SUSE)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=jackmanb@google.com \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox