Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Brendan Jackman <jackmanb@google.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>, Wei Xu <weixugc@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	Lorenzo Stoakes <ljs@kernel.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	rppt@kernel.org, Sumit Garg <sumit.garg@oss.qualcomm.com>,
	derkling@google.com, reijiw@google.com,
	Will Deacon <will@kernel.org>,
	rientjes@google.com, "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	patrick.roy@linux.dev, "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
Date: Mon, 1 Jun 2026 10:59:43 +0200	[thread overview]
Message-ID: <27da4fd7-195f-4086-992e-287f79eb974b@kernel.org> (raw)
In-Reply-To: <DIJEIELZ5DJU.26LYHOT4WR7A2@google.com>

On 5/15/26 18:46, Brendan Jackman wrote:
> On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote:
>> On 3/20/26 19:23, Brendan Jackman wrote:
>>> Currently __GFP_UNMAPPED allocs will always fail because, although the
>>> lists exist to hold them, there is no way to actually create an unmapped
>>> page block. This commit adds one, and also the logic to map it back
>>> again when that's needed.
>>> 
>>> Doing this at pageblock granularity ensures that the pageblock flags can
>>> be used to infer which freetype a page belongs to. It also provides nice
>>> batching of TLB flushes, and also avoids creating too much unnecessary
>>> TLB fragmentation in the physmap.
>>> 
>>> There are some functional requirements for flipping a block:
>>> 
>>>  - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
>>> 
>>>  - Because the main usecase of this feature is to protect against CPU
>>>    exploits, when a block is mapped it needs to be zeroed to ensure no
>>>    residual data is available to attackers. Zeroing a block with a
>>>    spinlock held seems undesirable.
>>
>> Did I overlook something or this patch doesn't do this whole block zeroing?
>> Or is it handled by set_direct_map_valid_noflush itself?
> 
> Oops. At some point I was planning to defer the zeroing to another
> series. I changed my mind about that but, apparently I forgot to
> actually add the code back.
> 
> The code I deleted was in __rmqueue_direct_map() like this:
> 
> 	if (want_mapped) {
> 		<zero the block>
> 	} else {
> 		unsigned long start = (unsigned long)page_address(page);
> 		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
> 
> 		flush_tlb_kernel_range(start, end);
> 	}
> 
> But actually I'm not sure that's what we want: At the moment, there's
> actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO:
> 
> 1. Take page off freelist
> 2. Mermap it
> 3. Zero it
> 4. Mer-unmap it
> 
> I don't know, but some sort of CPU attack might support exploiting the
> gap between 2 and 3 to leak any data left behind from a prior
> allocation. (Like, maybe you can get the data into a uarch buffer during
> the race window, then leak that data afterwards at leisure).

I can't imagine how it would work, but then novel CPU attacks might be
beyond my imagination :) But I think we can ignore hypothetical CPU attacks
for now?

> To mitigate that, we might want to effectively enforce
> want_init_on_free() for unmapped blocks. And, if we do that, we
> don't actually need to zero the block when flipping it back to mapped,
> since there shouldn't be any user data in there.
> 
> Any thoughts on that? I have not tried to implement it yet, I might be
> missing something that makes it impractical. Also I haven't read that
> series that's doing zeroing through user addresses either, this might
> have an interesting interaction with that.

I think let's try the simplest way first.

>>>  - Updating the pagetables might require allocating a pagetable to break
>>>    down a huge page. This would deadlock if the zone lock was held.
>>> 
>>> This makes allocations that need to change sensitivity _somewhat_
>>> similar to those that need to fallback to a different migratetype. But,
>>> the locking requirements mean that this can't just be squashed into the
>>> existing "fallback" allocator logic, instead a new allocator path just
>>> for this purpose is needed.
>>> 
>>> The new path is assumed to be much cheaper than the really heavyweight
>>> stuff like compaction and reclaim. But at present it is treated as less
>>
>> Uhh, speaking of compaction and reclaim... we rely on finding a whole free
>> pageblock in order to flip it. If that doesn't exist, the whole
>> get_page_from_freelist() will fail, and we might enter the
>> reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
>> ultimately want an order-0 allocation, there won't be any compaction
>> attempted, because that code won't know we failed to flip a pageblock. And
>> the watermarks might look good and prevent reclaim as well I think? We
>> should somehow indicate this, and handle accordingly. Might not be trivial.
>> Or maybe reuse pageblock isolation code to do the migrations directly in
>> __rmqueue_direct_map?
> 
> Ah, thanks, I suspect you are right.
> 
> I did fear there would be some sort of case where this "not-quite
> reclaim" interacted badly with the actual reclaim, and I tried to test
> it by running some stuff in parallel with stress-ng (allocating
> __GFP_UNMAPPED via secretmem), and I didn't see a difference in the
> effective availability of memory. However, I suspect testing this is
> quite a deep art my "run these two commands that I copy pasted from an
> LLM suggestion" test was just crap.
> 
> Do you have any workloads you can suggest for evaluating this kinda
> thing? We would definitely see it in Google prod (I think we see this
> kind of issue with our shrinker-based internal version of ASI distorting
> reclaim behaviour in ways even more subtle than this) but that is not a
> very practical experimental cycle...

Your test seems a good way to start.
I realized afterwards that the solution might be something similar to how we
handle ALLOC_NOFRAGMENT.

>>>  
>>> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
>>> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
>>> +static inline struct page *
>>> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
>>> +		     unsigned int alloc_flags, freetype_t freetype)
>>> +{
>>> +	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
>>> +	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
>>> +						  ft_flags_other);
>>> +	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
>>> +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>>
>> Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
>> not just the unmapped flag?
> 
> Oh right, actually I think we need to do RMQUEUE_CLAIM _and_
> RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports
> allocating from blocks that already have the requested migratetype).
> 
> If we just switch it over to just RMQUEUE_CLAIM right now, while only
> one migrateteype supports FREETYPE_UNMAPPED, I think that would actually
> be broken: When allocating an unmapped block, (want_mapped=true) we
> would always hit the freetype_idx<0 case in find_suitable_fallback().

Right.

> But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss
> opportunities to allocate from other unmapped freetypes once those
> exist.
> 
>>> +	unsigned long irq_flags;
>>> +	int nr_pageblocks;
>>> +	struct page *page;
>>> +	int alloc_order;
>>> +	int err;
>>> +
>>> +	if (freetype_idx(ft_other) < 0)
>>> +		return NULL;
>>> +
>>> +	/*
>>> +	 * Might need a TLB shootdown. Even if IRQs are on this isn't
>>> +	 * safe if the caller holds a lock (in case the other CPUs need that
>>> +	 * lock to handle the shootdown IPI).
>>> +	 */
>>> +	if (alloc_flags & ALLOC_NOBLOCK)
>>> +		return NULL;
>>> +
>>> +	if (!can_set_direct_map())
>>> +		return NULL;
>>> +
>>> +	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
>>> +
>>> +	/*
>>> +	 * Need to [un]map a whole pageblock (otherwise it might require
>>> +	 * allocating pagetables). First allocate it.
>>> +	 */
>>> +	alloc_order = max(request_order, pageblock_order);
>>> +	nr_pageblocks = 1 << (alloc_order - pageblock_order);
>>> +	zone_lock_irqsave(zone, irq_flags);
>>> +	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
>>> +	zone_unlock_irqrestore(zone, irq_flags);
>>> +	if (!page)
>>> +		return NULL;
>>> +
>>> +	/*
>>> +	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
>>> +	 * released the zone lock it's possible to allocate a pagetable if
>>> +	 * needed to split up a huge page.
>>> +	 *
>>> +	 * Note that modifying the direct map may need to allocate pagetables.
>>> +	 * What about unbounded recursion? Here are the assumptions that make it
>>> +	 * safe:
>>> +	 *
>>> +	 * - The direct map starts out fully mapped at boot. (This is not really
>>> +	 *   an assumption" as its in direct control of page_alloc.c).
>>> +	 *
>>> +	 * - Once pages in the direct map are broken down, they are not
>>> +	 *   re-aggregated into larger pages again.
>>> +	 *
>>> +	 * - Pagetables are never allocated with __GFP_UNMAPPED.
>>> +	 *
>>> +	 * Under these assumptions, a pagetable might need to be allocated while
>>> +	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
>>> +	 * allocation. But, the allocation of that pagetable never requires
>>> +	 * allocating a further pagetable.
>>> +	 */
>>> +	err = set_direct_map_valid_noflush(page,
>>> +				nr_pageblocks << pageblock_order, want_mapped);
>>> +	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
>>> +		zone_lock_irqsave(zone, irq_flags);
>>> +		__free_one_page(page, page_to_pfn(page), zone,
>>> +				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
>>> +		zone_unlock_irqrestore(zone, irq_flags);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	if (!want_mapped) {
>>> +		unsigned long start = (unsigned long)page_address(page);
>>> +		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
>>> +
>>> +		flush_tlb_kernel_range(start, end);
>>> +	}
>>> +
>>> +	for (int i = 0; i < nr_pageblocks; i++) {
>>> +		struct page *block_page = page + (pageblock_nr_pages * i);
>>> +
>>> +		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
>>> +	}
>>> +
>>> +	if (request_order >= alloc_order)
>>> +		return page;
>>> +
>>> +	/* Free any remaining pages in the block. */
>>> +	zone_lock_irqsave(zone, irq_flags);
>>> +	for (unsigned int i = request_order; i < alloc_order; i++) {
>>> +		struct page *page_to_free = page + (1 << i);
>>> +
>>> +		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
>>> +			i, freetype, FPI_SKIP_REPORT_NOTIFY);
>>> +	}
>>
>> Could expand() be used here?
> 
> Hm, good point. It should probably look like what try_to_claim_block()
> does...
> 
> Instead of figuring that out right now I'll just say this: if that works
> I'll do it, if I find a reason why it doesn't I will add a comment
> explaining it in the next version.

Sounds good.

> BTW my thinking is that clarity is the only important factor here, I am
> confident that any speedup from this would disappear in the noise of the
> TLB flushing etc. But, if it works then yeah I think it would actually
> be clearer.

Sure clarity is important, but also if we have multiple functions doing
similar thing instead of sharing code, there's a risk a future change to the
more common code will miss the new one, etc.

> Thanks very much for this review, I really appreciate it!

next prev parent reply	other threads:[~2026-06-01  8:59 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-03-20 19:42   ` Dave Hansen
2026-03-23 11:01     ` Brendan Jackman
2026-03-24 15:27   ` Borislav Petkov
2026-03-25 13:28     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-03-20 19:47   ` Dave Hansen
2026-03-23 12:01     ` Brendan Jackman
2026-03-23 12:57       ` Brendan Jackman
2026-03-25 14:23   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
2026-03-26 16:14   ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
2026-03-24  8:00   ` kernel test robot
2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
2026-05-11 13:46   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-05-11 13:51   ` Vlastimil Babka (SUSE)
2026-05-11 16:44     ` Brendan Jackman
2026-05-11 16:53       ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
2026-05-11 15:34   ` Vlastimil Babka (SUSE)
2026-05-11 16:49     ` Brendan Jackman
2026-05-11 16:58       ` Vlastimil Babka (SUSE)
2026-05-11 18:17   ` Vlastimil Babka (SUSE)
2026-05-11 18:26   ` Vlastimil Babka (SUSE)
2026-05-18  0:00     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-05-11 15:35   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-05-11 18:01   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
2026-05-11 18:07   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-05-11 18:29   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-05-11 18:30   ` Vlastimil Babka (SUSE)
2026-05-12  9:49     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-05-13  8:46   ` Vlastimil Babka (SUSE)
2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-05-13  9:43   ` Vlastimil Babka (SUSE)
2026-05-15 13:36     ` Brendan Jackman
2026-05-15 15:52       ` Gregory Price
2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-05-13 15:43   ` Vlastimil Babka (SUSE)
2026-05-15 16:46     ` Brendan Jackman
2026-05-29 15:02       ` Brendan Jackman
2026-06-01  8:50         ` Vlastimil Babka (SUSE)
2026-06-11 14:46           ` Brendan Jackman
2026-06-01  8:59       ` Vlastimil Babka (SUSE) [this message]
2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-05-13 17:00   ` Vlastimil Babka (SUSE)
2026-05-15 16:50     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman
2026-03-31 14:40   ` Brendan Jackman
2026-05-13 16:17 ` [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Gregory Price
2026-05-13 17:14   ` Brendan Jackman
2026-05-13 17:28     ` Gregory Price
2026-05-13 17:38       ` Vlastimil Babka (SUSE)
2026-05-13 17:59         ` Gregory Price
2026-05-15  9:31           ` Brendan Jackman
2026-05-15 16:04             ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27da4fd7-195f-4086-992e-287f79eb974b@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=jackmanb@google.com \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox