From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3EE3C44BC90
	for <linux-kernel@vger.kernel.org>; Wed, 13 May 2026 15:44:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778687041; cv=none; b=SyHOlq+aQJjR68ZleEsXz84Lb7hTNLMChzL1Rt7WTSxc3kCYk1YYRtoMunbRKXb9Bmm5iv1DL8qoDnRLQ+5JmZEkSgfzDGO67sJvlEPkZuL00Fxn+Fdq5JvHSKsYmo3yBZEOAYufH4nT81v++R+jmJ7jGTuN7WmJnvwWWH/U3Js=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778687041; c=relaxed/simple;
	bh=QN3lkpWX6/RaYjZoXh5Y8X/VI8sqvCkTxCQQVxzTQ88=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=ag1GhiE5T3cKs3hX+GP47fNuFkNnmIfISLAIkKMsBW2USZzzT3vJ78quBmSceftEXpUZVe1pPUKfWnDIttDmsG0jRBHt7myVBYw/edgyEAXy8/UYkG5sz7+lCh8xANoscPIavSLwpiEJy66Qhc7hHvcEXYSGzL7VqZPWV6d+nV0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=h9+6z/QX; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="h9+6z/QX"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id DE604C19425;
	Wed, 13 May 2026 15:43:55 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1778687040;
	bh=QN3lkpWX6/RaYjZoXh5Y8X/VI8sqvCkTxCQQVxzTQ88=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=h9+6z/QXmvdkS7AIqu0WCQcRRM0QN7p2yKk+LlFb/wgisYuz7QgCx0EWkXYT1eoFG
	 NExuRCo17IiKDXlTT4LATSwPAKNHF+WXUOMrdm4oWew278uemlzTnBBZFN0xgJ4/hF
	 i25q5PuISjab56cfnMCrUT9nYJHLB0684ghfzuzRnB9yKK0tkzrm44JrseAFKrAeJa
	 l13wu5eW/ZzRYQ4ougprDm8q0NJZsMaSH28kP86/6yeCkAcgeoiu+hi2eTs3zV+REG
	 5pAMOt1OtmYy5zXBGwvHIbUFcUNjnpagpEJx1IoXPLYhoiiUWKQv9nX/2FvCbbL8Jx
	 Iv9Bo4E0TLmIA==
Message-ID: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org>
Date: Wed, 13 May 2026 17:43:53 +0200
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED
 allocations
Content-Language: en-US
To: Brendan Jackman <jackmanb@google.com>, Borislav Petkov <bp@alien8.de>,
 Dave Hansen <dave.hansen@linux.intel.com>,
 Peter Zijlstra <peterz@infradead.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 David Hildenbrand <david@kernel.org>, Wei Xu <weixugc@google.com>,
 Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
 Lorenzo Stoakes <ljs@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
 rppt@kernel.org, Sumit Garg <sumit.garg@oss.qualcomm.com>,
 derkling@google.com, reijiw@google.com, Will Deacon <will@kernel.org>,
 rientjes@google.com, "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
 patrick.roy@linux.dev, "Itazuri, Takahiro" <itazur@amazon.co.uk>,
 Andy Lutomirski <luto@kernel.org>, David Kaplan <david.kaplan@amd.com>,
 Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com>
 <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
In-Reply-To: <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 3/20/26 19:23, Brendan Jackman wrote:
> Currently __GFP_UNMAPPED allocs will always fail because, although the
> lists exist to hold them, there is no way to actually create an unmapped
> page block. This commit adds one, and also the logic to map it back
> again when that's needed.
> 
> Doing this at pageblock granularity ensures that the pageblock flags can
> be used to infer which freetype a page belongs to. It also provides nice
> batching of TLB flushes, and also avoids creating too much unnecessary
> TLB fragmentation in the physmap.
> 
> There are some functional requirements for flipping a block:
> 
>  - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
> 
>  - Because the main usecase of this feature is to protect against CPU
>    exploits, when a block is mapped it needs to be zeroed to ensure no
>    residual data is available to attackers. Zeroing a block with a
>    spinlock held seems undesirable.

Did I overlook something or this patch doesn't do this whole block zeroing?
Or is it handled by set_direct_map_valid_noflush itself?

>  - Updating the pagetables might require allocating a pagetable to break
>    down a huge page. This would deadlock if the zone lock was held.
> 
> This makes allocations that need to change sensitivity _somewhat_
> similar to those that need to fallback to a different migratetype. But,
> the locking requirements mean that this can't just be squashed into the
> existing "fallback" allocator logic, instead a new allocator path just
> for this purpose is needed.
> 
> The new path is assumed to be much cheaper than the really heavyweight
> stuff like compaction and reclaim. But at present it is treated as less

Uhh, speaking of compaction and reclaim... we rely on finding a whole free
pageblock in order to flip it. If that doesn't exist, the whole
get_page_from_freelist() will fail, and we might enter the
reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
ultimately want an order-0 allocation, there won't be any compaction
attempted, because that code won't know we failed to flip a pageblock. And
the watermarks might look good and prevent reclaim as well I think? We
should somehow indicate this, and handle accordingly. Might not be trivial.
Or maybe reuse pageblock isolation code to do the migrations directly in
__rmqueue_direct_map?

> desirable than the mobility-related "fallback" and "stealing" logic.
> This might turn out to need revision (in particular, maybe it's a
> problem that __rmqueue_steal(), which causes fragmentation, happens
> before __rmqueue_direct_map()), but that should be treated as a subsequent
> optimisation project.
> 
> This currently forbids __GFP_ZERO, this is just to keep the patch from
> getting too large, the next patch will remove this restriction.
> 
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  include/linux/gfp.h |  11 +++-
>  mm/Kconfig          |   4 +-
>  mm/page_alloc.c     | 171 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 170 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 34a38c420e84a..2d8279c6300d3 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -24,6 +24,7 @@ struct mempolicy;
>  static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
>  {
>  	int migratetype;
> +	unsigned int ft_flags = 0;
>  
>  	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
>  	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
> @@ -40,7 +41,15 @@ static inline freetype_t gfp_freetype(const gfp_t gfp_flags)
>  			>> GFP_MOVABLE_SHIFT;
>  	}
>  
> -	return migrate_to_freetype(migratetype, 0);
> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
> +	if (gfp_flags & __GFP_UNMAPPED) {
> +		if (WARN_ON_ONCE(migratetype != MIGRATE_UNMOVABLE))
> +			migratetype = MIGRATE_UNMOVABLE;
> +		ft_flags |= FREETYPE_UNMAPPED;
> +	}
> +#endif
> +
> +	return migrate_to_freetype(migratetype, ft_flags);
>  }
>  #undef GFP_MOVABLE_MASK
>  #undef GFP_MOVABLE_SHIFT
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b915af74d33cc..e4cb52149acad 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1505,8 +1505,8 @@ config MERMAP_KUNIT_TEST
>  
>  	  If unsure, say N.
>  
> -endmenu
> -
>  config PAGE_ALLOC_UNMAPPED
>  	bool "Support allocating pages that aren't in the direct map" if COMPILE_TEST
>  	default COMPILE_TEST
> +
> +endmenu
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 83d06a6db6433..710ee9f46d467 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -34,6 +34,7 @@
>  #include <linux/folio_batch.h>
>  #include <linux/memory_hotplug.h>
>  #include <linux/nodemask.h>
> +#include <linux/set_memory.h>
>  #include <linux/vmstat.h>
>  #include <linux/fault-inject.h>
>  #include <linux/compaction.h>
> @@ -1002,6 +1003,26 @@ static void change_pageblock_range(struct page *pageblock_page,
>  	}
>  }
>  
> +/*
> + * Can pages of these two freetypes be combined into a single higher-order free
> + * page?
> + */
> +static inline bool can_merge_freetypes(freetype_t a, freetype_t b)
> +{
> +	if (freetypes_equal(a, b))
> +		return true;
> +
> +	if (!migratetype_is_mergeable(free_to_migratetype(a)) ||
> +	    !migratetype_is_mergeable(free_to_migratetype(b)))
> +		return false;
> +
> +	/*
> +	 * Mustn't "just" merge pages with different freetype flags, changing
> +	 * those requires updating pagetables.
> +	 */
> +	return freetype_flags(a) == freetype_flags(b);
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1070,9 +1091,7 @@ static inline void __free_one_page(struct page *page,
>  			buddy_ft = get_pfnblock_freetype(buddy, buddy_pfn);
>  			buddy_mt = free_to_migratetype(buddy_ft);
>  
> -			if (migratetype != buddy_mt &&
> -			    (!migratetype_is_mergeable(migratetype) ||
> -			     !migratetype_is_mergeable(buddy_mt)))
> +			if (!can_merge_freetypes(freetype, buddy_ft))
>  				goto done_merging;
>  		}
>  
> @@ -1089,7 +1108,9 @@ static inline void __free_one_page(struct page *page,
>  			/*
>  			 * Match buddy type. This ensures that an
>  			 * expand() down the line puts the sub-blocks
> -			 * on the right freelists.
> +			 * on the right freelists. Freetype flags are
> +			 * already set correctly because of
> +			 * can_merge_freetypes().
>  			 */
>  			change_pageblock_range(buddy, order, migratetype);
>  		}
> @@ -1982,6 +2003,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  	struct free_area *area;
>  	struct page *page;
>  
> +	if (freetype_idx(freetype) < 0)
> +		return NULL;
> +
>  	/* Find a page of the appropriate size in the preferred list */
>  	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
>  		enum migratetype migratetype = free_to_migratetype(freetype);
> @@ -3324,6 +3348,119 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>  #endif
>  }
>  
> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
> +static inline struct page *
> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
> +		     unsigned int alloc_flags, freetype_t freetype)
> +{
> +	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
> +	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
> +						  ft_flags_other);
> +	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
> +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;

Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
not just the unmapped flag?

> +	unsigned long irq_flags;
> +	int nr_pageblocks;
> +	struct page *page;
> +	int alloc_order;
> +	int err;
> +
> +	if (freetype_idx(ft_other) < 0)
> +		return NULL;
> +
> +	/*
> +	 * Might need a TLB shootdown. Even if IRQs are on this isn't
> +	 * safe if the caller holds a lock (in case the other CPUs need that
> +	 * lock to handle the shootdown IPI).
> +	 */
> +	if (alloc_flags & ALLOC_NOBLOCK)
> +		return NULL;
> +
> +	if (!can_set_direct_map())
> +		return NULL;
> +
> +	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
> +
> +	/*
> +	 * Need to [un]map a whole pageblock (otherwise it might require
> +	 * allocating pagetables). First allocate it.
> +	 */
> +	alloc_order = max(request_order, pageblock_order);
> +	nr_pageblocks = 1 << (alloc_order - pageblock_order);
> +	zone_lock_irqsave(zone, irq_flags);
> +	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
> +	zone_unlock_irqrestore(zone, irq_flags);
> +	if (!page)
> +		return NULL;
> +
> +	/*
> +	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
> +	 * released the zone lock it's possible to allocate a pagetable if
> +	 * needed to split up a huge page.
> +	 *
> +	 * Note that modifying the direct map may need to allocate pagetables.
> +	 * What about unbounded recursion? Here are the assumptions that make it
> +	 * safe:
> +	 *
> +	 * - The direct map starts out fully mapped at boot. (This is not really
> +	 *   an assumption" as its in direct control of page_alloc.c).
> +	 *
> +	 * - Once pages in the direct map are broken down, they are not
> +	 *   re-aggregated into larger pages again.
> +	 *
> +	 * - Pagetables are never allocated with __GFP_UNMAPPED.
> +	 *
> +	 * Under these assumptions, a pagetable might need to be allocated while
> +	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
> +	 * allocation. But, the allocation of that pagetable never requires
> +	 * allocating a further pagetable.
> +	 */
> +	err = set_direct_map_valid_noflush(page,
> +				nr_pageblocks << pageblock_order, want_mapped);
> +	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
> +		zone_lock_irqsave(zone, irq_flags);
> +		__free_one_page(page, page_to_pfn(page), zone,
> +				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
> +		zone_unlock_irqrestore(zone, irq_flags);
> +		return NULL;
> +	}
> +
> +	if (!want_mapped) {
> +		unsigned long start = (unsigned long)page_address(page);
> +		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
> +
> +		flush_tlb_kernel_range(start, end);
> +	}
> +
> +	for (int i = 0; i < nr_pageblocks; i++) {
> +		struct page *block_page = page + (pageblock_nr_pages * i);
> +
> +		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
> +	}
> +
> +	if (request_order >= alloc_order)
> +		return page;
> +
> +	/* Free any remaining pages in the block. */
> +	zone_lock_irqsave(zone, irq_flags);
> +	for (unsigned int i = request_order; i < alloc_order; i++) {
> +		struct page *page_to_free = page + (1 << i);
> +
> +		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
> +			i, freetype, FPI_SKIP_REPORT_NOTIFY);
> +	}

Could expand() be used here?

> +	zone_unlock_irqrestore(zone, irq_flags);
> +
> +	return page;
> +}
> +#else /* CONFIG_PAGE_ALLOC_UNMAPPED */
> +static inline struct page *__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
> +				unsigned int alloc_flags, freetype_t freetype)
> +{
> +	return NULL;
> +}
> +#endif /* CONFIG_PAGE_ALLOC_UNMAPPED */
> +
>  static __always_inline
>  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			   unsigned int order, unsigned int alloc_flags,
> @@ -3331,8 +3468,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  {
>  	struct page *page;
>  	unsigned long flags;
> -	freetype_t ft_high = freetype_with_migrate(freetype,
> -						       MIGRATE_HIGHATOMIC);
> +	freetype_t ft_high = freetype_with_migrate(freetype, MIGRATE_HIGHATOMIC);
>  
>  	do {
>  		page = NULL;
> @@ -3357,13 +3493,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			 */
>  			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_HARDER)))
>  				page = __rmqueue_smallest(zone, order, ft_high);
> -
> -			if (!page) {
> -				zone_unlock_irqrestore(zone, flags);
> -				return NULL;
> -			}
>  		}
>  		zone_unlock_irqrestore(zone, flags);
> +
> +		/* Try changing direct map, now we've released the zone lock */
> +		if (!page)
> +			page = __rmqueue_direct_map(zone, order, alloc_flags, freetype);
> +		if (!page)
> +			return NULL;
> +
>  	} while (check_new_pages(page, order));
>  
>  	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> @@ -3587,6 +3725,8 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
>  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  						bool force)
>  {
> +	freetype_t ft_high = freetype_with_migrate(ac->freetype,
> +					MIGRATE_HIGHATOMIC);
>  	struct zonelist *zonelist = ac->zonelist;
>  	unsigned long flags;
>  	struct zoneref *z;
> @@ -3595,6 +3735,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  	int order;
>  	int ret;
>  
> +	if (freetype_idx(ft_high) < 0)
> +		return false;
> +
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx,
>  								ac->nodemask) {
>  		/*
> @@ -3608,8 +3751,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  		zone_lock_irqsave(zone, flags);
>  		for (order = 0; order < NR_PAGE_ORDERS; order++) {
>  			struct free_area *area = &(zone->free_area[order]);
> -			freetype_t ft_high = freetype_with_migrate(ac->freetype,
> -							MIGRATE_HIGHATOMIC);
>  			unsigned long size;
>  
>  			page = get_page_from_free_area(area, ft_high);
> @@ -5109,6 +5250,10 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	ac->nodemask = nodemask;
>  	ac->freetype = gfp_freetype(gfp_mask);
>  
> +	/* Not implemented yet. */
> +	if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED && gfp_mask & __GFP_ZERO)
> +		return false;
> +
>  	if (cpusets_enabled()) {
>  		*alloc_gfp |= __GFP_HARDWALL;
>  		/*
>