From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-wr1-f73.google.com (mail-wr1-f73.google.com [209.85.221.73])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6AD73D3D05
	for <linux-kernel@vger.kernel.org>; Fri, 15 May 2026 16:46:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.73
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778863583; cv=none; b=jR8n19LmV14wf1P4LUGSpQZRE5v6HL1vKhc5tPULlI//AJBATfveDTeEdQEFuHdlgy4SQb3t0nIlhffGyJ02otbRtMnCrDM3+Y8hOCfHUWkfBbZdmN8w8HOnDMrudf9EUl7CD1knRZZuLW7d8gMNs4OW4wh6+5dhxIniwhF8R2Q=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778863583; c=relaxed/simple;
	bh=fikLVg0D055R8YFJkzIs0FzvweDHbCoGv1jZynD3bgM=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type; b=PKqcQLfYwzYZPyjuVDwJzWGRjAvcfxSRLgsFmslS4YSYjZf/uU23hPo35Wg86CdSDsTFXyW2Pw5bLLaVtYwsaIU3gsFUmWPGhxXEQLd0jkv9Th5PCIyygY5VGWvVhX1HxSFdTkvyuF94wANJT63np+tLvLuW1K5bsGDzG32qRo0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jackmanb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=m/6h8Xgp; arc=none smtp.client-ip=209.85.221.73
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jackmanb.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="m/6h8Xgp"
Received: by mail-wr1-f73.google.com with SMTP id ffacd0b85a97d-45e55c44ac1so2660f8f.0
        for <linux-kernel@vger.kernel.org>; Fri, 15 May 2026 09:46:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20251104; t=1778863580; x=1779468380; darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=;
        b=m/6h8XgpH/720Cv2L2t0Lr/2JaggP7FGFoz243/CPIaVLMJ0H0juPGLcoOmoikhKtx
         7FF4UKaFICZSLP75+5ExATYIa/GazTqdjTcBZsGurYTWKLFGb5AVPs6NPAtuwIvNbdwt
         h2FjRRFUART1QM+GYSekzY2Jt6TldMM+95QN487UJlg+fJd6fyt4TD9W/TQoDDMvf5u/
         lSGYL2x31KNv8WY3obgVy7A6rCjdoRJ6XOAN+/UDWA1DHXenzdVmnmEcFqyEuh5MR6xH
         CxlqwKYGkR4yY9545EAr/Xbg7ol8B+xhDjZHuXyJ0WPmfXRadEHl2/4Tjh0btAxjDxZp
         PG3w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1778863580; x=1779468380;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=;
        b=BlKSeSEzjT5Kpee+ew4GYuzDBcOi/o2n9gnQAb6bIW2LRt/3blLaJ+pvbrRCeuZFw1
         uaI5HdU2Zz5Ag6gx9pkvjzkhH0s+61fkQ05IEJ4kgxvy7xBDIqO8Rx9ijBffY/ObG9cO
         ufF+sHKBzD08t7Qufx+rhIv4ffp0OvKlGW1cF36VnReGlsfqnG9MOOVlqBiyhuMyexSF
         EUI48NrWGQlPu52meXRxI+rxvFwN3JUATqZkv/b5KWDR3WefIoEbG4uWsL0hYMRJeW3x
         /0DvKLy5XbI1JbwqVyvJmAlbvR4VGpgBuGlC1OLakn1RW4yssZRv2ZF0rPT8Lgrog6DA
         RLiA==
X-Forwarded-Encrypted: i=1; AFNElJ+y5hIC0rtxgcwyFXRfspKWhgNTIIQ/GEV81Fvld+K//9//SsOKxWZ1kDyvktjOM//ov4KzIZuyGqYMtJY=@vger.kernel.org
X-Gm-Message-State: AOJu0YyR2XABHmu4HbyJgoMBopLi3JhLqrpu9V4n7spNt14Gh2UlZvv3
	oqlupR/aF2YNkC+cr2LLbgIvND4ET/ljiybfVSe0qocobkyZVwEy2w6+vok5AxEN1fxNnLsr6TY
	NlDGWP65RaDO3lQ==
X-Received: from wmjy18.prod.google.com ([2002:a7b:cd92:0:b0:48a:54ff:28c8])
 (user=jackmanb job=prod-delivery.src-stubby-dispatcher) by
 2002:a05:600c:4e87:b0:48a:768b:eea9 with SMTP id 5b1f17b1804b1-48fe60e51bamr73387595e9.4.1778863579933;
 Fri, 15 May 2026 09:46:19 -0700 (PDT)
Date: Fri, 15 May 2026 16:46:18 +0000
In-Reply-To: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com>
 <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com> <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org>
X-Mailer: aerc 0.21.0
Message-ID: <DIJEIELZ5DJU.26LYHOT4WR7A2@google.com>
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
From: Brendan Jackman <jackmanb@google.com>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>, Brendan Jackman <jackmanb@google.com>, 
	Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, 
	Peter Zijlstra <peterz@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, 
	David Hildenbrand <david@kernel.org>, Wei Xu <weixugc@google.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>, Lorenzo Stoakes <ljs@kernel.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <x86@kernel.org>, 
	<rppt@kernel.org>, Sumit Garg <sumit.garg@oss.qualcomm.com>, <derkling@google.com>, 
	<reijiw@google.com>, Will Deacon <will@kernel.org>, <rientjes@google.com>, 
	"Kalyazin, Nikita" <kalyazin@amazon.co.uk>, <patrick.roy@linux.dev>, 
	"Itazuri, Takahiro" <itazur@amazon.co.uk>, Andy Lutomirski <luto@kernel.org>, 
	David Kaplan <david.kaplan@amd.com>, Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Content-Type: text/plain; charset="UTF-8"

On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 3/20/26 19:23, Brendan Jackman wrote:
>> Currently __GFP_UNMAPPED allocs will always fail because, although the
>> lists exist to hold them, there is no way to actually create an unmapped
>> page block. This commit adds one, and also the logic to map it back
>> again when that's needed.
>> 
>> Doing this at pageblock granularity ensures that the pageblock flags can
>> be used to infer which freetype a page belongs to. It also provides nice
>> batching of TLB flushes, and also avoids creating too much unnecessary
>> TLB fragmentation in the physmap.
>> 
>> There are some functional requirements for flipping a block:
>> 
>>  - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
>> 
>>  - Because the main usecase of this feature is to protect against CPU
>>    exploits, when a block is mapped it needs to be zeroed to ensure no
>>    residual data is available to attackers. Zeroing a block with a
>>    spinlock held seems undesirable.
>
> Did I overlook something or this patch doesn't do this whole block zeroing?
> Or is it handled by set_direct_map_valid_noflush itself?

Oops. At some point I was planning to defer the zeroing to another
series. I changed my mind about that but, apparently I forgot to
actually add the code back.

The code I deleted was in __rmqueue_direct_map() like this:

	if (want_mapped) {
		<zero the block>
	} else {
		unsigned long start = (unsigned long)page_address(page);
		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));

		flush_tlb_kernel_range(start, end);
	}

But actually I'm not sure that's what we want: At the moment, there's
actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO:

1. Take page off freelist
2. Mermap it
3. Zero it
4. Mer-unmap it

I don't know, but some sort of CPU attack might support exploiting the
gap between 2 and 3 to leak any data left behind from a prior
allocation. (Like, maybe you can get the data into a uarch buffer during
the race window, then leak that data afterwards at leisure).

To mitigate that, we might want to effectively enforce
want_init_on_free() for unmapped blocks. And, if we do that, we
don't actually need to zero the block when flipping it back to mapped,
since there shouldn't be any user data in there.

Any thoughts on that? I have not tried to implement it yet, I might be
missing something that makes it impractical. Also I haven't read that
series that's doing zeroing through user addresses either, this might
have an interesting interaction with that.

>>  - Updating the pagetables might require allocating a pagetable to break
>>    down a huge page. This would deadlock if the zone lock was held.
>> 
>> This makes allocations that need to change sensitivity _somewhat_
>> similar to those that need to fallback to a different migratetype. But,
>> the locking requirements mean that this can't just be squashed into the
>> existing "fallback" allocator logic, instead a new allocator path just
>> for this purpose is needed.
>> 
>> The new path is assumed to be much cheaper than the really heavyweight
>> stuff like compaction and reclaim. But at present it is treated as less
>
> Uhh, speaking of compaction and reclaim... we rely on finding a whole free
> pageblock in order to flip it. If that doesn't exist, the whole
> get_page_from_freelist() will fail, and we might enter the
> reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
> ultimately want an order-0 allocation, there won't be any compaction
> attempted, because that code won't know we failed to flip a pageblock. And
> the watermarks might look good and prevent reclaim as well I think? We
> should somehow indicate this, and handle accordingly. Might not be trivial.
> Or maybe reuse pageblock isolation code to do the migrations directly in
> __rmqueue_direct_map?

Ah, thanks, I suspect you are right.

I did fear there would be some sort of case where this "not-quite
reclaim" interacted badly with the actual reclaim, and I tried to test
it by running some stuff in parallel with stress-ng (allocating
__GFP_UNMAPPED via secretmem), and I didn't see a difference in the
effective availability of memory. However, I suspect testing this is
quite a deep art my "run these two commands that I copy pasted from an
LLM suggestion" test was just crap.

Do you have any workloads you can suggest for evaluating this kinda
thing? We would definitely see it in Google prod (I think we see this
kind of issue with our shrinker-based internal version of ASI distorting
reclaim behaviour in ways even more subtle than this) but that is not a
very practical experimental cycle...

>>  
>> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
>> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
>> +static inline struct page *
>> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
>> +		     unsigned int alloc_flags, freetype_t freetype)
>> +{
>> +	unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
>> +	freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
>> +						  ft_flags_other);
>> +	bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
>> +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>
> Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
> not just the unmapped flag?

Oh right, actually I think we need to do RMQUEUE_CLAIM _and_
RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports
allocating from blocks that already have the requested migratetype).

If we just switch it over to just RMQUEUE_CLAIM right now, while only
one migrateteype supports FREETYPE_UNMAPPED, I think that would actually
be broken: When allocating an unmapped block, (want_mapped=true) we
would always hit the freetype_idx<0 case in find_suitable_fallback().

But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss
opportunities to allocate from other unmapped freetypes once those
exist.

>> +	unsigned long irq_flags;
>> +	int nr_pageblocks;
>> +	struct page *page;
>> +	int alloc_order;
>> +	int err;
>> +
>> +	if (freetype_idx(ft_other) < 0)
>> +		return NULL;
>> +
>> +	/*
>> +	 * Might need a TLB shootdown. Even if IRQs are on this isn't
>> +	 * safe if the caller holds a lock (in case the other CPUs need that
>> +	 * lock to handle the shootdown IPI).
>> +	 */
>> +	if (alloc_flags & ALLOC_NOBLOCK)
>> +		return NULL;
>> +
>> +	if (!can_set_direct_map())
>> +		return NULL;
>> +
>> +	lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
>> +
>> +	/*
>> +	 * Need to [un]map a whole pageblock (otherwise it might require
>> +	 * allocating pagetables). First allocate it.
>> +	 */
>> +	alloc_order = max(request_order, pageblock_order);
>> +	nr_pageblocks = 1 << (alloc_order - pageblock_order);
>> +	zone_lock_irqsave(zone, irq_flags);
>> +	page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
>> +	zone_unlock_irqrestore(zone, irq_flags);
>> +	if (!page)
>> +		return NULL;
>> +
>> +	/*
>> +	 * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
>> +	 * released the zone lock it's possible to allocate a pagetable if
>> +	 * needed to split up a huge page.
>> +	 *
>> +	 * Note that modifying the direct map may need to allocate pagetables.
>> +	 * What about unbounded recursion? Here are the assumptions that make it
>> +	 * safe:
>> +	 *
>> +	 * - The direct map starts out fully mapped at boot. (This is not really
>> +	 *   an assumption" as its in direct control of page_alloc.c).
>> +	 *
>> +	 * - Once pages in the direct map are broken down, they are not
>> +	 *   re-aggregated into larger pages again.
>> +	 *
>> +	 * - Pagetables are never allocated with __GFP_UNMAPPED.
>> +	 *
>> +	 * Under these assumptions, a pagetable might need to be allocated while
>> +	 * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
>> +	 * allocation. But, the allocation of that pagetable never requires
>> +	 * allocating a further pagetable.
>> +	 */
>> +	err = set_direct_map_valid_noflush(page,
>> +				nr_pageblocks << pageblock_order, want_mapped);
>> +	if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
>> +		zone_lock_irqsave(zone, irq_flags);
>> +		__free_one_page(page, page_to_pfn(page), zone,
>> +				alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
>> +		zone_unlock_irqrestore(zone, irq_flags);
>> +		return NULL;
>> +	}
>> +
>> +	if (!want_mapped) {
>> +		unsigned long start = (unsigned long)page_address(page);
>> +		unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
>> +
>> +		flush_tlb_kernel_range(start, end);
>> +	}
>> +
>> +	for (int i = 0; i < nr_pageblocks; i++) {
>> +		struct page *block_page = page + (pageblock_nr_pages * i);
>> +
>> +		set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
>> +	}
>> +
>> +	if (request_order >= alloc_order)
>> +		return page;
>> +
>> +	/* Free any remaining pages in the block. */
>> +	zone_lock_irqsave(zone, irq_flags);
>> +	for (unsigned int i = request_order; i < alloc_order; i++) {
>> +		struct page *page_to_free = page + (1 << i);
>> +
>> +		__free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
>> +			i, freetype, FPI_SKIP_REPORT_NOTIFY);
>> +	}
>
> Could expand() be used here?

Hm, good point. It should probably look like what try_to_claim_block()
does...

Instead of figuring that out right now I'll just say this: if that works
I'll do it, if I find a reason why it doesn't I will add a comment
explaining it in the next version.

BTW my thinking is that clarity is the only important factor here, I am
confident that any speedup from this would disappear in the noise of the
TLB flushing etc. But, if it works then yeah I think it would actually
be clearer.

Thanks very much for this review, I really appreciate it!