Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Brendan Jackman <jackmanb@google.com>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	Brendan Jackman <jackmanb@google.com>,
	 Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Wei Xu <weixugc@google.com>,
	 Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	Lorenzo Stoakes <ljs@kernel.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<x86@kernel.org>,  <rppt@kernel.org>,
	Sumit Garg <sumit.garg@oss.qualcomm.com>, <derkling@google.com>,
	 <reijiw@google.com>, Will Deacon <will@kernel.org>,
	<rientjes@google.com>,
	 "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	<patrick.roy@linux.dev>,
	 "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	 David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
Date: Fri, 15 May 2026 16:46:18 +0000	[thread overview]
Message-ID: <DIJEIELZ5DJU.26LYHOT4WR7A2@google.com> (raw)
In-Reply-To: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org>

On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 3/20/26 19:23, Brendan Jackman wrote:
>> Currently __GFP_UNMAPPED allocs will always fail because, although the
>> lists exist to hold them, there is no way to actually create an unmapped
>> page block. This commit adds one, and also the logic to map it back
>> again when that's needed.
>> 
>> Doing this at pageblock granularity ensures that the pageblock flags can
>> be used to infer which freetype a page belongs to. It also provides nice
>> batching of TLB flushes, and also avoids creating too much unnecessary
>> TLB fragmentation in the physmap.
>> 
>> There are some functional requirements for flipping a block:
>> 
>>  - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
>> 
>>  - Because the main usecase of this feature is to protect against CPU
>>    exploits, when a block is mapped it needs to be zeroed to ensure no
>>    residual data is available to attackers. Zeroing a block with a
>>    spinlock held seems undesirable.
>
> Did I overlook something or this patch doesn't do this whole block zeroing?
> Or is it handled by set_direct_map_valid_noflush itself?

Oops. At some point I was planning to defer the zeroing to another
series. I changed my mind about that but, apparently I forgot to
actually add the code back.

The code I deleted was in __rmqueue_direct_map() like this:

	if (want_mapped) {
		<zero the block>
	} else {
		unsigned long start = (unsigned long)page_address(page);
		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));

		flush_tlb_kernel_range(start, end);
	}

But actually I'm not sure that's what we want: At the moment, there's
actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO:

1. Take page off freelist
2. Mermap it
3. Zero it
4. Mer-unmap it

I don't know, but some sort of CPU attack might support exploiting the
gap between 2 and 3 to leak any data left behind from a prior
allocation. (Like, maybe you can get the data into a uarch buffer during
the race window, then leak that data afterwards at leisure).

To mitigate that, we might want to effectively enforce
want_init_on_free() for unmapped blocks. And, if we do that, we
don't actually need to zero the block when flipping it back to mapped,
since there shouldn't be any user data in there.

Any thoughts on that? I have not tried to implement it yet, I might be
missing something that makes it impractical. Also I haven't read that
series that's doing zeroing through user addresses either, this might
have an interesting interaction with that.

>>  - Updating the pagetables might require allocating a pagetable to break
>>    down a huge page. This would deadlock if the zone lock was held.
>> 
>> This makes allocations that need to change sensitivity _somewhat_
>> similar to those that need to fallback to a different migratetype. But,
>> the locking requirements mean that this can't just be squashed into the
>> existing "fallback" allocator logic, instead a new allocator path just
>> for this purpose is needed.
>> 
>> The new path is assumed to be much cheaper than the really heavyweight
>> stuff like compaction and reclaim. But at present it is treated as less
>
> Uhh, speaking of compaction and reclaim... we rely on finding a whole free
> pageblock in order to flip it. If that doesn't exist, the whole
> get_page_from_freelist() will fail, and we might enter the
> reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
> ultimately want an order-0 allocation, there won't be any compaction
> attempted, because that code won't know we failed to flip a pageblock. And
> the watermarks might look good and prevent reclaim as well I think? We
> should somehow indicate this, and handle accordingly. Might not be trivial.
> Or maybe reuse pageblock isolation code to do the migrations directly in
> __rmqueue_direct_map?

Ah, thanks, I suspect you are right.

I did fear there would be some sort of case where this "not-quite
reclaim" interacted badly with the actual reclaim, and I tried to test
it by running some stuff in parallel with stress-ng (allocating
__GFP_UNMAPPED via secretmem), and I didn't see a difference in the
effective availability of memory. However, I suspect testing this is
quite a deep art my "run these two commands that I copy pasted from an
LLM suggestion" test was just crap.

Do you have any workloads you can suggest for evaluating this kinda
thing? We would definitely see it in Google prod (I think we see this
kind of issue with our shrinker-based internal version of ASI distorting
reclaim behaviour in ways even more subtle than this) but that is not a
very practical experimental cycle...

>>  
>> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
>> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
>> +static inline struct page *
>> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
>> +		     unsigned int alloc_flags, freetype_t freetype)
>> +{
>> +	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
>> +	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
>> +						  ft_flags_other);
>> +	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
>> +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>
> Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
> not just the unmapped flag?

Oh right, actually I think we need to do RMQUEUE_CLAIM _and_
RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports
allocating from blocks that already have the requested migratetype).

If we just switch it over to just RMQUEUE_CLAIM right now, while only
one migrateteype supports FREETYPE_UNMAPPED, I think that would actually
be broken: When allocating an unmapped block, (want_mapped=true) we
would always hit the freetype_idx<0 case in find_suitable_fallback().

But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss
opportunities to allocate from other unmapped freetypes once those
exist.

>> +	unsigned long irq_flags;
>> +	int nr_pageblocks;
>> +	struct page *page;
>> +	int alloc_order;
>> +	int err;
>> +
>> +	if (freetype_idx(ft_other) < 0)
>> +		return NULL;
>> +
>> +	/*
>> +	 * Might need a TLB shootdown. Even if IRQs are on this isn't
>> +	 * safe if the caller holds a lock (in case the other CPUs need that
>> +	 * lock to handle the shootdown IPI).
>> +	 */
>> +	if (alloc_flags & ALLOC_NOBLOCK)
>> +		return NULL;
>> +
>> +	if (!can_set_direct_map())
>> +		return NULL;
>> +
>> +	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
>> +
>> +	/*
>> +	 * Need to [un]map a whole pageblock (otherwise it might require
>> +	 * allocating pagetables). First allocate it.
>> +	 */
>> +	alloc_order = max(request_order, pageblock_order);
>> +	nr_pageblocks = 1 << (alloc_order - pageblock_order);
>> +	zone_lock_irqsave(zone, irq_flags);
>> +	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
>> +	zone_unlock_irqrestore(zone, irq_flags);
>> +	if (!page)
>> +		return NULL;
>> +
>> +	/*
>> +	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
>> +	 * released the zone lock it's possible to allocate a pagetable if
>> +	 * needed to split up a huge page.
>> +	 *
>> +	 * Note that modifying the direct map may need to allocate pagetables.
>> +	 * What about unbounded recursion? Here are the assumptions that make it
>> +	 * safe:
>> +	 *
>> +	 * - The direct map starts out fully mapped at boot. (This is not really
>> +	 *   an assumption" as its in direct control of page_alloc.c).
>> +	 *
>> +	 * - Once pages in the direct map are broken down, they are not
>> +	 *   re-aggregated into larger pages again.
>> +	 *
>> +	 * - Pagetables are never allocated with __GFP_UNMAPPED.
>> +	 *
>> +	 * Under these assumptions, a pagetable might need to be allocated while
>> +	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
>> +	 * allocation. But, the allocation of that pagetable never requires
>> +	 * allocating a further pagetable.
>> +	 */
>> +	err = set_direct_map_valid_noflush(page,
>> +				nr_pageblocks << pageblock_order, want_mapped);
>> +	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
>> +		zone_lock_irqsave(zone, irq_flags);
>> +		__free_one_page(page, page_to_pfn(page), zone,
>> +				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
>> +		zone_unlock_irqrestore(zone, irq_flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (!want_mapped) {
>> +		unsigned long start = (unsigned long)page_address(page);
>> +		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
>> +
>> +		flush_tlb_kernel_range(start, end);
>> +	}
>> +
>> +	for (int i = 0; i < nr_pageblocks; i++) {
>> +		struct page *block_page = page + (pageblock_nr_pages * i);
>> +
>> +		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
>> +	}
>> +
>> +	if (request_order >= alloc_order)
>> +		return page;
>> +
>> +	/* Free any remaining pages in the block. */
>> +	zone_lock_irqsave(zone, irq_flags);
>> +	for (unsigned int i = request_order; i < alloc_order; i++) {
>> +		struct page *page_to_free = page + (1 << i);
>> +
>> +		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
>> +			i, freetype, FPI_SKIP_REPORT_NOTIFY);
>> +	}
>
> Could expand() be used here?

Hm, good point. It should probably look like what try_to_claim_block()
does...

Instead of figuring that out right now I'll just say this: if that works
I'll do it, if I find a reason why it doesn't I will add a comment
explaining it in the next version.

BTW my thinking is that clarity is the only important factor here, I am
confident that any speedup from this would disappear in the noise of the
TLB flushing etc. But, if it works then yeah I think it would actually
be clearer.

Thanks very much for this review, I really appreciate it!

next prev parent reply	other threads:[~2026-05-15 17:23 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-03-20 19:42   ` Dave Hansen
2026-03-23 11:01     ` Brendan Jackman
2026-03-24 15:27   ` Borislav Petkov
2026-03-25 13:28     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-03-20 19:47   ` Dave Hansen
2026-03-23 12:01     ` Brendan Jackman
2026-03-23 12:57       ` Brendan Jackman
2026-03-25 14:23   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
2026-03-26 16:14   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
2026-03-24  8:00   ` kernel test robot
2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
2026-05-11 13:46   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-05-11 13:51   ` Vlastimil Babka (SUSE)
2026-05-11 16:44     ` Brendan Jackman
2026-05-11 16:53       ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
2026-05-11 15:34   ` Vlastimil Babka (SUSE)
2026-05-11 16:49     ` Brendan Jackman
2026-05-11 16:58       ` Vlastimil Babka (SUSE)
2026-05-11 18:17   ` Vlastimil Babka (SUSE)
2026-05-11 18:26   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-05-11 15:35   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-05-11 18:01   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
2026-05-11 18:07   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-05-11 18:29   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-05-11 18:30   ` Vlastimil Babka (SUSE)
2026-05-12  9:49     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-05-13  8:46   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-05-13  9:43   ` Vlastimil Babka (SUSE)
2026-05-15 13:36     ` Brendan Jackman
2026-05-15 15:52       ` Gregory Price
2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-05-13 15:43   ` Vlastimil Babka (SUSE)
2026-05-15 16:46     ` Brendan Jackman [this message]
2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-05-13 17:00   ` Vlastimil Babka (SUSE)
2026-05-15 16:50     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman
2026-03-31 14:40   ` Brendan Jackman
2026-05-13 16:17 ` [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Gregory Price
2026-05-13 17:14   ` Brendan Jackman
2026-05-13 17:28     ` Gregory Price
2026-05-13 17:38       ` Vlastimil Babka (SUSE)
2026-05-13 17:59         ` Gregory Price
2026-05-15  9:31           ` Brendan Jackman
2026-05-15 16:04             ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DIJEIELZ5DJU.26LYHOT4WR7A2@google.com \
    --to=jackmanb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox