From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f73.google.com (mail-wr1-f73.google.com [209.85.221.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6AD73D3D05 for ; Fri, 15 May 2026 16:46:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778863583; cv=none; b=jR8n19LmV14wf1P4LUGSpQZRE5v6HL1vKhc5tPULlI//AJBATfveDTeEdQEFuHdlgy4SQb3t0nIlhffGyJ02otbRtMnCrDM3+Y8hOCfHUWkfBbZdmN8w8HOnDMrudf9EUl7CD1knRZZuLW7d8gMNs4OW4wh6+5dhxIniwhF8R2Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778863583; c=relaxed/simple; bh=fikLVg0D055R8YFJkzIs0FzvweDHbCoGv1jZynD3bgM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=PKqcQLfYwzYZPyjuVDwJzWGRjAvcfxSRLgsFmslS4YSYjZf/uU23hPo35Wg86CdSDsTFXyW2Pw5bLLaVtYwsaIU3gsFUmWPGhxXEQLd0jkv9Th5PCIyygY5VGWvVhX1HxSFdTkvyuF94wANJT63np+tLvLuW1K5bsGDzG32qRo0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jackmanb.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=m/6h8Xgp; arc=none smtp.client-ip=209.85.221.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jackmanb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="m/6h8Xgp" Received: by mail-wr1-f73.google.com with SMTP id ffacd0b85a97d-45e55c44ac1so2660f8f.0 for ; Fri, 15 May 2026 09:46:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1778863580; x=1779468380; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=; b=m/6h8XgpH/720Cv2L2t0Lr/2JaggP7FGFoz243/CPIaVLMJ0H0juPGLcoOmoikhKtx 7FF4UKaFICZSLP75+5ExATYIa/GazTqdjTcBZsGurYTWKLFGb5AVPs6NPAtuwIvNbdwt h2FjRRFUART1QM+GYSekzY2Jt6TldMM+95QN487UJlg+fJd6fyt4TD9W/TQoDDMvf5u/ lSGYL2x31KNv8WY3obgVy7A6rCjdoRJ6XOAN+/UDWA1DHXenzdVmnmEcFqyEuh5MR6xH CxlqwKYGkR4yY9545EAr/Xbg7ol8B+xhDjZHuXyJ0WPmfXRadEHl2/4Tjh0btAxjDxZp PG3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778863580; x=1779468380; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=; b=BlKSeSEzjT5Kpee+ew4GYuzDBcOi/o2n9gnQAb6bIW2LRt/3blLaJ+pvbrRCeuZFw1 uaI5HdU2Zz5Ag6gx9pkvjzkhH0s+61fkQ05IEJ4kgxvy7xBDIqO8Rx9ijBffY/ObG9cO ufF+sHKBzD08t7Qufx+rhIv4ffp0OvKlGW1cF36VnReGlsfqnG9MOOVlqBiyhuMyexSF EUI48NrWGQlPu52meXRxI+rxvFwN3JUATqZkv/b5KWDR3WefIoEbG4uWsL0hYMRJeW3x /0DvKLy5XbI1JbwqVyvJmAlbvR4VGpgBuGlC1OLakn1RW4yssZRv2ZF0rPT8Lgrog6DA RLiA== X-Forwarded-Encrypted: i=1; AFNElJ+y5hIC0rtxgcwyFXRfspKWhgNTIIQ/GEV81Fvld+K//9//SsOKxWZ1kDyvktjOM//ov4KzIZuyGqYMtJY=@vger.kernel.org X-Gm-Message-State: AOJu0YyR2XABHmu4HbyJgoMBopLi3JhLqrpu9V4n7spNt14Gh2UlZvv3 oqlupR/aF2YNkC+cr2LLbgIvND4ET/ljiybfVSe0qocobkyZVwEy2w6+vok5AxEN1fxNnLsr6TY NlDGWP65RaDO3lQ== X-Received: from wmjy18.prod.google.com ([2002:a7b:cd92:0:b0:48a:54ff:28c8]) (user=jackmanb job=prod-delivery.src-stubby-dispatcher) by 2002:a05:600c:4e87:b0:48a:768b:eea9 with SMTP id 5b1f17b1804b1-48fe60e51bamr73387595e9.4.1778863579933; Fri, 15 May 2026 09:46:19 -0700 (PDT) Date: Fri, 15 May 2026 16:46:18 +0000 In-Reply-To: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com> <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com> <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> X-Mailer: aerc 0.21.0 Message-ID: Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations From: Brendan Jackman To: "Vlastimil Babka (SUSE)" , Brendan Jackman , Borislav Petkov , Dave Hansen , Peter Zijlstra , Andrew Morton , David Hildenbrand , Wei Xu , Johannes Weiner , Zi Yan , Lorenzo Stoakes Cc: , , , , Sumit Garg , , , Will Deacon , , "Kalyazin, Nikita" , , "Itazuri, Takahiro" , Andy Lutomirski , David Kaplan , Thomas Gleixner , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote: > On 3/20/26 19:23, Brendan Jackman wrote: >> Currently __GFP_UNMAPPED allocs will always fail because, although the >> lists exist to hold them, there is no way to actually create an unmapped >> page block. This commit adds one, and also the logic to map it back >> again when that's needed. >> >> Doing this at pageblock granularity ensures that the pageblock flags can >> be used to infer which freetype a page belongs to. It also provides nice >> batching of TLB flushes, and also avoids creating too much unnecessary >> TLB fragmentation in the physmap. >> >> There are some functional requirements for flipping a block: >> >> - Unmapping requires a TLB shootdown, meaning IRQs must be enabled. >> >> - Because the main usecase of this feature is to protect against CPU >> exploits, when a block is mapped it needs to be zeroed to ensure no >> residual data is available to attackers. Zeroing a block with a >> spinlock held seems undesirable. > > Did I overlook something or this patch doesn't do this whole block zeroing? > Or is it handled by set_direct_map_valid_noflush itself? Oops. At some point I was planning to defer the zeroing to another series. I changed my mind about that but, apparently I forgot to actually add the code back. The code I deleted was in __rmqueue_direct_map() like this: if (want_mapped) { } else { unsigned long start = (unsigned long)page_address(page); unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT)); flush_tlb_kernel_range(start, end); } But actually I'm not sure that's what we want: At the moment, there's actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO: 1. Take page off freelist 2. Mermap it 3. Zero it 4. Mer-unmap it I don't know, but some sort of CPU attack might support exploiting the gap between 2 and 3 to leak any data left behind from a prior allocation. (Like, maybe you can get the data into a uarch buffer during the race window, then leak that data afterwards at leisure). To mitigate that, we might want to effectively enforce want_init_on_free() for unmapped blocks. And, if we do that, we don't actually need to zero the block when flipping it back to mapped, since there shouldn't be any user data in there. Any thoughts on that? I have not tried to implement it yet, I might be missing something that makes it impractical. Also I haven't read that series that's doing zeroing through user addresses either, this might have an interesting interaction with that. >> - Updating the pagetables might require allocating a pagetable to break >> down a huge page. This would deadlock if the zone lock was held. >> >> This makes allocations that need to change sensitivity _somewhat_ >> similar to those that need to fallback to a different migratetype. But, >> the locking requirements mean that this can't just be squashed into the >> existing "fallback" allocator logic, instead a new allocator path just >> for this purpose is needed. >> >> The new path is assumed to be much cheaper than the really heavyweight >> stuff like compaction and reclaim. But at present it is treated as less > > Uhh, speaking of compaction and reclaim... we rely on finding a whole free > pageblock in order to flip it. If that doesn't exist, the whole > get_page_from_freelist() will fail, and we might enter the > reclaim/compaction cycle in __allow_pages_slowpath(). But since we might > ultimately want an order-0 allocation, there won't be any compaction > attempted, because that code won't know we failed to flip a pageblock. And > the watermarks might look good and prevent reclaim as well I think? We > should somehow indicate this, and handle accordingly. Might not be trivial. > Or maybe reuse pageblock isolation code to do the migrations directly in > __rmqueue_direct_map? Ah, thanks, I suspect you are right. I did fear there would be some sort of case where this "not-quite reclaim" interacted badly with the actual reclaim, and I tried to test it by running some stuff in parallel with stress-ng (allocating __GFP_UNMAPPED via secretmem), and I didn't see a difference in the effective availability of memory. However, I suspect testing this is quite a deep art my "run these two commands that I copy pasted from an LLM suggestion" test was just crap. Do you have any workloads you can suggest for evaluating this kinda thing? We would definitely see it in Google prod (I think we see this kind of issue with our shrinker-based internal version of ASI distorting reclaim behaviour in ways even more subtle than this) but that is not a very practical experimental cycle... >> >> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED >> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */ >> +static inline struct page * >> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order, >> + unsigned int alloc_flags, freetype_t freetype) >> +{ >> + unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED; >> + freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype), >> + ft_flags_other); >> + bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED); >> + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; > > Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well, > not just the unmapped flag? Oh right, actually I think we need to do RMQUEUE_CLAIM _and_ RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports allocating from blocks that already have the requested migratetype). If we just switch it over to just RMQUEUE_CLAIM right now, while only one migrateteype supports FREETYPE_UNMAPPED, I think that would actually be broken: When allocating an unmapped block, (want_mapped=true) we would always hit the freetype_idx<0 case in find_suitable_fallback(). But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss opportunities to allocate from other unmapped freetypes once those exist. >> + unsigned long irq_flags; >> + int nr_pageblocks; >> + struct page *page; >> + int alloc_order; >> + int err; >> + >> + if (freetype_idx(ft_other) < 0) >> + return NULL; >> + >> + /* >> + * Might need a TLB shootdown. Even if IRQs are on this isn't >> + * safe if the caller holds a lock (in case the other CPUs need that >> + * lock to handle the shootdown IPI). >> + */ >> + if (alloc_flags & ALLOC_NOBLOCK) >> + return NULL; >> + >> + if (!can_set_direct_map()) >> + return NULL; >> + >> + lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled)); >> + >> + /* >> + * Need to [un]map a whole pageblock (otherwise it might require >> + * allocating pagetables). First allocate it. >> + */ >> + alloc_order = max(request_order, pageblock_order); >> + nr_pageblocks = 1 << (alloc_order - pageblock_order); >> + zone_lock_irqsave(zone, irq_flags); >> + page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm); >> + zone_unlock_irqrestore(zone, irq_flags); >> + if (!page) >> + return NULL; >> + >> + /* >> + * Now that IRQs are on it's safe to do a TLB shootdown, and now that we >> + * released the zone lock it's possible to allocate a pagetable if >> + * needed to split up a huge page. >> + * >> + * Note that modifying the direct map may need to allocate pagetables. >> + * What about unbounded recursion? Here are the assumptions that make it >> + * safe: >> + * >> + * - The direct map starts out fully mapped at boot. (This is not really >> + * an assumption" as its in direct control of page_alloc.c). >> + * >> + * - Once pages in the direct map are broken down, they are not >> + * re-aggregated into larger pages again. >> + * >> + * - Pagetables are never allocated with __GFP_UNMAPPED. >> + * >> + * Under these assumptions, a pagetable might need to be allocated while >> + * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED >> + * allocation. But, the allocation of that pagetable never requires >> + * allocating a further pagetable. >> + */ >> + err = set_direct_map_valid_noflush(page, >> + nr_pageblocks << pageblock_order, want_mapped); >> + if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) { >> + zone_lock_irqsave(zone, irq_flags); >> + __free_one_page(page, page_to_pfn(page), zone, >> + alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY); >> + zone_unlock_irqrestore(zone, irq_flags); >> + return NULL; >> + } >> + >> + if (!want_mapped) { >> + unsigned long start = (unsigned long)page_address(page); >> + unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT)); >> + >> + flush_tlb_kernel_range(start, end); >> + } >> + >> + for (int i = 0; i < nr_pageblocks; i++) { >> + struct page *block_page = page + (pageblock_nr_pages * i); >> + >> + set_pageblock_freetype_flags(block_page, freetype_flags(freetype)); >> + } >> + >> + if (request_order >= alloc_order) >> + return page; >> + >> + /* Free any remaining pages in the block. */ >> + zone_lock_irqsave(zone, irq_flags); >> + for (unsigned int i = request_order; i < alloc_order; i++) { >> + struct page *page_to_free = page + (1 << i); >> + >> + __free_one_page(page_to_free, page_to_pfn(page_to_free), zone, >> + i, freetype, FPI_SKIP_REPORT_NOTIFY); >> + } > > Could expand() be used here? Hm, good point. It should probably look like what try_to_claim_block() does... Instead of figuring that out right now I'll just say this: if that works I'll do it, if I find a reason why it doesn't I will add a comment explaining it in the next version. BTW my thinking is that clarity is the only important factor here, I am confident that any speedup from this would disappear in the noise of the TLB flushing etc. But, if it works then yeah I think it would actually be clearer. Thanks very much for this review, I really appreciate it!