From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 484B6CD4F47 for ; Fri, 15 May 2026 17:23:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8521D6B0093; Fri, 15 May 2026 13:23:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8294F6B0095; Fri, 15 May 2026 13:23:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73FC46B0096; Fri, 15 May 2026 13:23:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6267A6B0093 for ; Fri, 15 May 2026 13:23:01 -0400 (EDT) Received: from smtpin18.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay10.hostedemail.com (Postfix) with ESMTP id ECF3FC05DC for ; Fri, 15 May 2026 16:46:23 +0000 (UTC) X-FDA: 84770232246.18.B8A0215 Received: from mail-wm1-f74.google.com (mail-wm1-f74.google.com [209.85.128.74]) by imf21.hostedemail.com (Postfix) with ESMTP id 1901A1C000B for ; Fri, 15 May 2026 16:46:21 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20251104 header.b=LgBuqkRO; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of 3200HaggKCOoVMOWYMZNSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--jackmanb.bounces.google.com designates 209.85.128.74 as permitted sender) smtp.mailfrom=3200HaggKCOoVMOWYMZNSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--jackmanb.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1778863582; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=; b=ebBWOgLmGOYpQRCiLnAxlC7aYmZoJ22CbuKVat+ynM35Gi+YclZ8XBNCI5fmlr47BrXnRj Gkp7BNxx7qvKpvrOPzCUc/Bp5DwLSK8rqeq5NBoKYMDYDELY9TJpGxjcG6R7qyXvYN4e4z GamY3xdj8AxeHv0/6gcKkOEpLPyoJCU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1778863582; a=rsa-sha256; cv=none; b=Za5WgO+DvfeTgPOnhZHSfELZ0D3/juxXy5u+7OoPcwt5nI9yKbq5rYj+1yXbB6S4tjv8Km 4zBkKxzkRjo+3FNsyjPsHfWkhxwQ8EEsHrNM+LtL/NBiS3KGTWRE+51wtGXUepjyzuzPZ/ SR5ItDf1ZUTslYyhU9Ad14X0Tq4383s= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20251104 header.b=LgBuqkRO; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of 3200HaggKCOoVMOWYMZNSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--jackmanb.bounces.google.com designates 209.85.128.74 as permitted sender) smtp.mailfrom=3200HaggKCOoVMOWYMZNSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--jackmanb.bounces.google.com Received: by mail-wm1-f74.google.com with SMTP id 5b1f17b1804b1-48fdacf2616so56225e9.3 for ; Fri, 15 May 2026 09:46:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1778863580; x=1779468380; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=; b=LgBuqkROe1nyULCVdaMBJzk8BbB9HI5YT6z6GELd2zYCHZ9S0LU+lbRGX1cviCkQwU yJz1hn+U4tKKxzcWJ3GTKUvxLue/q4YzXc1LBmc20kXqTBmOBSpUlq5OKta+/WZ6aOje s98+fXnjYfBflAFU9x4TobiOg8GKfeyhTvPYBpNY2ij0lh8ErUntomKCfUzFHRqVU2wR rpuBbS+bX7siuiCnY46DLE/dAVMH6mGi4MgIx4EAf4yc2rfYziyhn2qyRey9vBQHsi7Q uZx6zKH30fhVYDGJ24Tr0Rkc1D9N3HHrM6utVomDQo8v1+0MHdTVdGPx1+pX1RxZEit+ 3zXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778863580; x=1779468380; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0HQsDyo4+dkGqRSlyKJUkE9oJB6v6iQjX4Cvq2k71JY=; b=OZkKuGfjtam5iEIDuXdmwUK1pwN3rQiQZgpGtWtM1lrXAzF/MJzFCnzJXFA7fRjTEM 9IDbKudITR+EjI5BpeNfM20U9l5pTyf1G/YUcQwpWM9VWTxXFisxKo7skL45yqX5Nr7A NAIVefeEzOTEIkw+6uKkGMvUUFZdyW35TdQv8KmeYsOT81RI0ysHxk505FWHvLBQy9Gx jz539bVY4hfuLrjcVMrtwEaPcbUmBzw1kP2gRRFYnRQZISsP7HC527CbY6LNpeAlGA6v R4sQD9V7TWh2ANKBJcsWFfu0AW8gY2fbiw2LBrlXeCaejeteXHbDc/uc0/XkjEYoZ+1J FmFA== X-Gm-Message-State: AOJu0YyARylFbZJSh6OjFhUIylc93Ots2I4HWhOfqYNsB2ju54JLp3yJ 4rn1sWUybnMuwxcTVTczQrOotC+7+WjqD+o6IXsyPi6DUXDM0uHHmhlBqiJ2OqwToUY/upTwEQS 9xwiYY/z+24X7VA== X-Received: from wmjy18.prod.google.com ([2002:a7b:cd92:0:b0:48a:54ff:28c8]) (user=jackmanb job=prod-delivery.src-stubby-dispatcher) by 2002:a05:600c:4e87:b0:48a:768b:eea9 with SMTP id 5b1f17b1804b1-48fe60e51bamr73387595e9.4.1778863579933; Fri, 15 May 2026 09:46:19 -0700 (PDT) Date: Fri, 15 May 2026 16:46:18 +0000 In-Reply-To: <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> Mime-Version: 1.0 References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com> <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com> <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> X-Mailer: aerc 0.21.0 Message-ID: Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations From: Brendan Jackman To: "Vlastimil Babka (SUSE)" , Brendan Jackman , Borislav Petkov , Dave Hansen , Peter Zijlstra , Andrew Morton , David Hildenbrand , Wei Xu , Johannes Weiner , Zi Yan , Lorenzo Stoakes Cc: , , , , Sumit Garg , , , Will Deacon , , "Kalyazin, Nikita" , , "Itazuri, Takahiro" , Andy Lutomirski , David Kaplan , Thomas Gleixner , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 1901A1C000B X-Stat-Signature: uadak5fhj7m85nwohzp645rdw576tmcm X-Rspam-User: X-HE-Tag: 1778863581-222424 X-HE-Meta: U2FsdGVkX1+B0WMyk9ejz+Y3FGrtZTr/7fUAESpr3p6fTIyN1cy4oDS5BwbH07lvuW5OPziyxClEfV0q/jZGC7mOWhVeG3RZl7CjwgLs3hCI4Wh4wCx3IG4O/OWAhxSEqnzW2ZSFQO9puVvUNNyX8Fv0Oq+MMbDXD7KfLkYcNBIOc12T7RfW94XIpxXLFHkJC+r5b1juoOftalHXKWudDEx8PNb/RXNTe7L+Hh5WvXNnWFEBOmGT8wBOQZmM8lPGQ/tgDC9gnYLD7Bv3u5qRco+6N7yy3NjI6XVqWzMqIT8DUwQKfmJbHFW7tmD5g9IHufzJOYyut/f15niWb34nZ961Lhh/ZLlH954wPjhvUiCNw01vwlZFcbRQIefzsLW5nwoaasPD53Fk0AGUocPv0ZpRWx7KmvLSQfVRetTJZ2cuNzqVasuEsEg7mL9G4M5lz/1YfrmP5kZYzd46k3NeoZp9HnqtwOQMmwvu1so5luUSd+Ew6li4DYEMShQ7Q2iyNlWuvh9ZD38yV3uY/mFCp/vm2tXmmi/O0Kn+xNPbDp13jARTq32Rnn4fQBM4bC/kMfLhmmzexCFa5XXB78cNNJl8PoxiM9APocYAzn1ik0oSmszN3ehSAeyBo2lCabvSFCzOKOdMnqDaVXfpqVXladiElwol0DwQLuLm/30nV7GfNZUHIL9PLpgiA3En+1JOL7H5x9R+bkB1qmUP+fqHOSmyXHJ84jg5cmiEG6EeCr50UsHyrVU3ukwgljqd9h9VbOchqx+wrEZfHzsx5mr68VhlBXhoKO3s+aruHuU1GMxuNkcF9K3kU3PlpQQUP8dfdDiXlC1QNQ5hw8YSA8idhnHmUbhl7iBnE60V2aH+JVl+tPuc2agT5sjTrGcpZVfpM3Yi7dIiX8vEeuN6rE/AEwXHTmpqjzZ5RQMD/CEUU1jHbzqKLpnAugwTho9be9+2XRAUXmXe7uZG46N45Mi UVuZY1Ox AeJfz7Y3vsUQBLAl4YLuEZd1oPHGCUQPX1HleR+LwvYUW/qLtZqgYLrrjGc5aOnrE/+y9aPE+omrAbuzob6ahhl4j7ryH8wqEn91XGJn6sVzuSUqFsCiBCoXJBh4nAixz9VUg6HXM/vUfLwEnfpVT4jOiQ+yZVzNQIefjop2lvn1cvVvl/PEaEO/d3C5oyjIug8zKywq2dwgWpb5IyPt9pS9xLG3GMVPT/NASwwoerKZGOkc58wlxkOcyUC+C6t5jxnqOFp+TEAks1YK7mpm2eEGc6hG3ZSkbnoIkoc7vszI384752nH2tWwjQeuaVEYYi/nmLu9Vd9LQ70ynrOC9NJeZmOemTB9tGyeRTYAo3khu6n8xZCRKOBkIzYi9UHIDx+ooRQQ6yBzitrSywjYvw1wS11YLGj9ZQM4w+d0kJuCGTq1HWD6gZ+X/WMNPxUDWc3iPlJk170Ew1j/yTekg7uzZNcsQz7CAVXJcVJVpU5VtamuIi7KDVR/nnQC1DPwnvfg0Ig/Q+WVtbFpwOrTHm+CLacKYSQ08xwaTZels+EqBUT0M4hdeXpkeHksFDBR/y0CK4Z35vt/NH2Q5HUeoWk7XOQcULYeFu/xv7z+UQypl+ULH6/nEaNkw4LDkq9M8MiTsTiFOwgtxbI5crRWqzX8FflGsEfBvYahE6YWrw0r/zy5AdWS3rP7A80UYst3kb2BlJyjYszbxaAY= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote: > On 3/20/26 19:23, Brendan Jackman wrote: >> Currently __GFP_UNMAPPED allocs will always fail because, although the >> lists exist to hold them, there is no way to actually create an unmapped >> page block. This commit adds one, and also the logic to map it back >> again when that's needed. >> >> Doing this at pageblock granularity ensures that the pageblock flags can >> be used to infer which freetype a page belongs to. It also provides nice >> batching of TLB flushes, and also avoids creating too much unnecessary >> TLB fragmentation in the physmap. >> >> There are some functional requirements for flipping a block: >> >> - Unmapping requires a TLB shootdown, meaning IRQs must be enabled. >> >> - Because the main usecase of this feature is to protect against CPU >> exploits, when a block is mapped it needs to be zeroed to ensure no >> residual data is available to attackers. Zeroing a block with a >> spinlock held seems undesirable. > > Did I overlook something or this patch doesn't do this whole block zeroing? > Or is it handled by set_direct_map_valid_noflush itself? Oops. At some point I was planning to defer the zeroing to another series. I changed my mind about that but, apparently I forgot to actually add the code back. The code I deleted was in __rmqueue_direct_map() like this: if (want_mapped) { } else { unsigned long start = (unsigned long)page_address(page); unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT)); flush_tlb_kernel_range(start, end); } But actually I'm not sure that's what we want: At the moment, there's actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO: 1. Take page off freelist 2. Mermap it 3. Zero it 4. Mer-unmap it I don't know, but some sort of CPU attack might support exploiting the gap between 2 and 3 to leak any data left behind from a prior allocation. (Like, maybe you can get the data into a uarch buffer during the race window, then leak that data afterwards at leisure). To mitigate that, we might want to effectively enforce want_init_on_free() for unmapped blocks. And, if we do that, we don't actually need to zero the block when flipping it back to mapped, since there shouldn't be any user data in there. Any thoughts on that? I have not tried to implement it yet, I might be missing something that makes it impractical. Also I haven't read that series that's doing zeroing through user addresses either, this might have an interesting interaction with that. >> - Updating the pagetables might require allocating a pagetable to break >> down a huge page. This would deadlock if the zone lock was held. >> >> This makes allocations that need to change sensitivity _somewhat_ >> similar to those that need to fallback to a different migratetype. But, >> the locking requirements mean that this can't just be squashed into the >> existing "fallback" allocator logic, instead a new allocator path just >> for this purpose is needed. >> >> The new path is assumed to be much cheaper than the really heavyweight >> stuff like compaction and reclaim. But at present it is treated as less > > Uhh, speaking of compaction and reclaim... we rely on finding a whole free > pageblock in order to flip it. If that doesn't exist, the whole > get_page_from_freelist() will fail, and we might enter the > reclaim/compaction cycle in __allow_pages_slowpath(). But since we might > ultimately want an order-0 allocation, there won't be any compaction > attempted, because that code won't know we failed to flip a pageblock. And > the watermarks might look good and prevent reclaim as well I think? We > should somehow indicate this, and handle accordingly. Might not be trivial. > Or maybe reuse pageblock isolation code to do the migrations directly in > __rmqueue_direct_map? Ah, thanks, I suspect you are right. I did fear there would be some sort of case where this "not-quite reclaim" interacted badly with the actual reclaim, and I tried to test it by running some stuff in parallel with stress-ng (allocating __GFP_UNMAPPED via secretmem), and I didn't see a difference in the effective availability of memory. However, I suspect testing this is quite a deep art my "run these two commands that I copy pasted from an LLM suggestion" test was just crap. Do you have any workloads you can suggest for evaluating this kinda thing? We would definitely see it in Google prod (I think we see this kind of issue with our shrinker-based internal version of ASI distorting reclaim behaviour in ways even more subtle than this) but that is not a very practical experimental cycle... >> >> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED >> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */ >> +static inline struct page * >> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order, >> + unsigned int alloc_flags, freetype_t freetype) >> +{ >> + unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED; >> + freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype), >> + ft_flags_other); >> + bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED); >> + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; > > Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well, > not just the unmapped flag? Oh right, actually I think we need to do RMQUEUE_CLAIM _and_ RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports allocating from blocks that already have the requested migratetype). If we just switch it over to just RMQUEUE_CLAIM right now, while only one migrateteype supports FREETYPE_UNMAPPED, I think that would actually be broken: When allocating an unmapped block, (want_mapped=true) we would always hit the freetype_idx<0 case in find_suitable_fallback(). But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss opportunities to allocate from other unmapped freetypes once those exist. >> + unsigned long irq_flags; >> + int nr_pageblocks; >> + struct page *page; >> + int alloc_order; >> + int err; >> + >> + if (freetype_idx(ft_other) < 0) >> + return NULL; >> + >> + /* >> + * Might need a TLB shootdown. Even if IRQs are on this isn't >> + * safe if the caller holds a lock (in case the other CPUs need that >> + * lock to handle the shootdown IPI). >> + */ >> + if (alloc_flags & ALLOC_NOBLOCK) >> + return NULL; >> + >> + if (!can_set_direct_map()) >> + return NULL; >> + >> + lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled)); >> + >> + /* >> + * Need to [un]map a whole pageblock (otherwise it might require >> + * allocating pagetables). First allocate it. >> + */ >> + alloc_order = max(request_order, pageblock_order); >> + nr_pageblocks = 1 << (alloc_order - pageblock_order); >> + zone_lock_irqsave(zone, irq_flags); >> + page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm); >> + zone_unlock_irqrestore(zone, irq_flags); >> + if (!page) >> + return NULL; >> + >> + /* >> + * Now that IRQs are on it's safe to do a TLB shootdown, and now that we >> + * released the zone lock it's possible to allocate a pagetable if >> + * needed to split up a huge page. >> + * >> + * Note that modifying the direct map may need to allocate pagetables. >> + * What about unbounded recursion? Here are the assumptions that make it >> + * safe: >> + * >> + * - The direct map starts out fully mapped at boot. (This is not really >> + * an assumption" as its in direct control of page_alloc.c). >> + * >> + * - Once pages in the direct map are broken down, they are not >> + * re-aggregated into larger pages again. >> + * >> + * - Pagetables are never allocated with __GFP_UNMAPPED. >> + * >> + * Under these assumptions, a pagetable might need to be allocated while >> + * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED >> + * allocation. But, the allocation of that pagetable never requires >> + * allocating a further pagetable. >> + */ >> + err = set_direct_map_valid_noflush(page, >> + nr_pageblocks << pageblock_order, want_mapped); >> + if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) { >> + zone_lock_irqsave(zone, irq_flags); >> + __free_one_page(page, page_to_pfn(page), zone, >> + alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY); >> + zone_unlock_irqrestore(zone, irq_flags); >> + return NULL; >> + } >> + >> + if (!want_mapped) { >> + unsigned long start = (unsigned long)page_address(page); >> + unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT)); >> + >> + flush_tlb_kernel_range(start, end); >> + } >> + >> + for (int i = 0; i < nr_pageblocks; i++) { >> + struct page *block_page = page + (pageblock_nr_pages * i); >> + >> + set_pageblock_freetype_flags(block_page, freetype_flags(freetype)); >> + } >> + >> + if (request_order >= alloc_order) >> + return page; >> + >> + /* Free any remaining pages in the block. */ >> + zone_lock_irqsave(zone, irq_flags); >> + for (unsigned int i = request_order; i < alloc_order; i++) { >> + struct page *page_to_free = page + (1 << i); >> + >> + __free_one_page(page_to_free, page_to_pfn(page_to_free), zone, >> + i, freetype, FPI_SKIP_REPORT_NOTIFY); >> + } > > Could expand() be used here? Hm, good point. It should probably look like what try_to_claim_block() does... Instead of figuring that out right now I'll just say this: if that works I'll do it, if I find a reason why it doesn't I will add a comment explaining it in the next version. BTW my thinking is that clarity is the only important factor here, I am confident that any speedup from this would disappear in the noise of the TLB flushing etc. But, if it works then yeah I think it would actually be clearer. Thanks very much for this review, I really appreciate it!