From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 39DCFCD6E56 for ; Mon, 1 Jun 2026 08:59:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61DCB6B02E4; Mon, 1 Jun 2026 04:59:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CE8F6B02E5; Mon, 1 Jun 2026 04:59:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4BD4B6B02E6; Mon, 1 Jun 2026 04:59:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3957D6B02E4 for ; Mon, 1 Jun 2026 04:59:53 -0400 (EDT) Received: from smtpin03.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CAF2F120524 for ; Mon, 1 Jun 2026 08:59:52 +0000 (UTC) X-FDA: 84830746224.03.F30856D Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf28.hostedemail.com (Postfix) with ESMTP id 13AFEC0008 for ; Mon, 1 Jun 2026 08:59:50 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b="b2CW/Rq8"; spf=pass (imf28.hostedemail.com: domain of vbabka@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1780304391; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BYCwGT0dlz7zUrwC4bUk/y8SmCvJ8n+sjToxU1QLoXY=; b=XRZ9BZSTr7eZNM0aBO2wZ11b2iBoNTYOlVVc5y6ht6vqhr0OC/tzSntUs4+b9nVjp+jhgr lAJfVh20VkiCxkYT0Vfk8Kdm3xflYe2Np7MvPAeJQh0kMsmvZRNUHFMePKm3q0JxIDY7jh LYXBOvcGuvAzLq0MVJKeYTsjm9ApM5w= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20260515 header.b="b2CW/Rq8"; spf=pass (imf28.hostedemail.com: domain of vbabka@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1780304391; a=rsa-sha256; cv=none; b=ab5H2ugyPMDJe9RhB1iFHb7MyVA4heBqzEe1kW0q1r9R+vE+oTXoX7t+80ZH3LFd0TyV0h 0sTTb9csbsCyEqf7ar6ZlPln13cF5QbmKtwQwMKSWB0kxVpxaA0uLxkDGb75+Reb5QPgEz 1AZW6cVa5jjbi+DmCXhW6eO8OA55w6k= Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by tor.source.kernel.org (Postfix) with ESMTP id 76A21601D9; Mon, 1 Jun 2026 08:59:50 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 47AAB1F00893; Mon, 1 Jun 2026 08:59:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780304390; bh=BYCwGT0dlz7zUrwC4bUk/y8SmCvJ8n+sjToxU1QLoXY=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=b2CW/Rq81dBsqMYMxNnYo9grQKxgqnJu9fa2DJ8t7IunVOGoSUJ0+O+Q4X8AuBjtl a5p4DaG+8WhEAW8ayJOxgKRl6sL6uY6L3QkdkTBRqd8all8NelhyhFWVoRCoBt369P GvP+PfI+b1Cc2LGJIzic4mxOjR7S9yqPZJ8nYs1QQuXR80SU/kHqdrACI/cmZ7xI/o F1Ua7ZArd0khTqtRFJKZMAEwelb1gdtgixIu3Z63EvWXn6ZxvMHnFVOYKlm2Z0kwDC X4a+B2ooTvaMlM4x5MMsGCkmZTCRPgyKvWXCvTVm1wzmo7MaKX+HxE9wlvfCH6O9W9 nyAy7njWi2q0g== Message-ID: <27da4fd7-195f-4086-992e-287f79eb974b@kernel.org> Date: Mon, 1 Jun 2026 10:59:43 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Content-Language: en-US To: Brendan Jackman , Borislav Petkov , Dave Hansen , Peter Zijlstra , Andrew Morton , David Hildenbrand , Wei Xu , Johannes Weiner , Zi Yan , Lorenzo Stoakes Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, rppt@kernel.org, Sumit Garg , derkling@google.com, reijiw@google.com, Will Deacon , rientjes@google.com, "Kalyazin, Nikita" , patrick.roy@linux.dev, "Itazuri, Takahiro" , Andy Lutomirski , David Kaplan , Thomas Gleixner , Yosry Ahmed References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com> <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com> <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org> From: "Vlastimil Babka (SUSE)" Autocrypt: addr=vbabka@kernel.org; keydata= xsFNBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABzSNWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBrZXJuZWwub3JnPsLBsAQTAQoAWhYhBKlA1DSZLC6OmRA9UCJPp+fM gqZkBQJqFFy6GxSAAAAAAAQADm1hbnUyLDIuNSsxLjEyLDIsMgIbAwUJGtCBUAULCQgHAwUV CgkICwUWAgMBAAIeBQIXgAAKCRAiT6fnzIKmZJIUEADFx/tREzUImHrEwVHeSvDFmA7tJysI UVrlvrM09E7GIuzphzv7jYmo8n3ANpCczLEVr4G0syYQdTigaZgv3+FQDIIzhKih1IHhu1Ei XHlywNWKnQxxQEUNi5Mwx43wQz5XVw9F1A7gtKBKNtfogO511hAbrzagrYajyQacEJ/+sfhZ 9Da8ltHIXD8pcYaHUfQgEusCgmEd9+KrUwrTbckFKmYq5chuE6yJ4J0EmWknL096jIE6CnzF FRslQ3B1UKDjxVsm1ZHfir5NeWszLkTvGFsddFaWTgh8UycESG6VQzKXjjewXu2pG7YQYRpj QKm1W5X2TkwWkXRBZTmfmbhxIUMh3+zf5wQ463rSmDN/8v81tdqBtAW6rH/kzg1GvkaTHXn0 507yEHFzBksk2viAuIxxr7km8+/KARYLIdGtx30EG8cKzAUZOK6WqxtNCsXUJNrVE8CWrCaD icoNu7Fs1c5hmPHdSTnU48ce67449DdnO4neLSNhRiGlMHJgfJUmgrxu/hcYeOZ3haWmEQ2w uW1Mh01OHi8QZHCEyAbABrPs9GUgccc/4eYXX9hIgxfSkYzn8f+8NuIFPWl/0uTvjgqU29FQ SbzOLxHq9439Ox40G5mS5eZXRGxITYR+6TXvRGI6P/264jvflnr/pDGUttaikU+0W+1uxgKH cmYbEc7ATQRbGTU1AQgAn0H6UrFiWcovkh6EXVcl+SeqyO6JHOPm+e9Wu0Vw+VIUvXZVUVVQ La1PQDUi6j00ChlcR66g9/V0sPIcSutacPKfdKYOBvzd4rlhL8rfrdEsQw5ApZxrA8kYZVMh FmBRKAa6wos25moTlMKpCWzTH84+WO5+ziCTsTUZASAToz3RdunTD+vQcHj0GqNTPAHK63sf bAB2I0BslZkXkY1RLb/YhuA6E7JyEd2pilZOrIuBGl/5q2qSakgnAVFWFBR/DO27JuAksYnq +aH8vI0xGvwn75KqSk4UzAkDzWSmO4ZHuahKtQgZNsMYV+PGayRBX9b9zbldzopoLBdqHc4n jQARAQABwsF8BBgBCgAmAhsMFiEEqUDUNJksLo6ZED1QIk+n58yCpmQFAmfIHFQFCRYU6J8A CgkQIk+n58yCpmS2PA//bqN1LfcotmArgElsa+0EGZSQlYgK48pm8WAeTXTngudP9IJ4SuKY HR5RNjHcBeqN+Me0zxRqYzRb8nGanHEkDyf4Im8DQM8d6vbyU+FcPmG4skud4kgS1zMHnlVd SXfSIwKC/hKgdHG8aBV7545Lz9X6Iohea+94wneD0aw/hqF+QWewGZhWJriWAZtvEkzNjQOi 4U9F/trLten/x7bpphDSnDMKJtITbtzATT1Dq7o7VpIUK1nCTQALMuMjKCdi8OdU/+V+R3O4 0PXWvX8qrvqYapVbZ+9KqT74FsuB0Ya9uXwgBF2Q6cRuETZk5vqaqKxzqoQZCO8AOz/58j6O 2RHNy/mZEN+7tJ5Tsq42zVJ4jxsT8b9YplavCMsnBgDeRWhcbYhCyttoL7nYISyWg4kQYZ/P wIV3OuNv2f8iKYsxNsRuClOAF82+gvqOy1/1pprFjy8uo2pkoOrb63aOP3vO5VHnRKgra6dq NcaZ+c6J4H+nEJGi2SkHAUJz5oBzuThvPudLvPA/SK8sKoM01IRxSihev/S/5WLazXB1PGem OCbvzC1IjWJJraxiDJ5IygokapUa2RP7+WBR22skQ3SSl6G107QgWKSyTOGWEaRmV53vxQLV jXuCmzSSasTL60zq5yGrT4/DYQVSNEUiUbG4pYekxJujNeEDkUlky0Y= In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam12 X-Stat-Signature: euzd5h1poxcmoyhy8u18hdzou77qe8a6 X-Rspam-User: X-Rspamd-Queue-Id: 13AFEC0008 X-HE-Tag: 1780304390-767957 X-HE-Meta: U2FsdGVkX189R7+4jbgaWgTYBhlXFUdE3ebhbPdQV9EDrY816JWe97Y3OTGitllkWZvjNKkmg30VbCMl+igTFfZRm3IxHXb943wm6octoPZHhBUsrXKwiZ1yrVaLxgwM6RFwqHjrmT8F1RKLA62wZDh5EEbU0kUoXv95KEXehtru1GaVqCmo8EbXaRGbieLXP2W/KsYHNEpfVn28qpuO8idv7fcilk/V3O3K/i7SgP1EkGATR4x22kh1ahl1yA1B9M6hneyQ99WOEHs31h3r2VL9GTD6ToTy2nlyMLHYJBiqaBp25Z7XqKAo1HgYAc/50oLGAXjC4ax323sClPPMCeUSkFeUzvIvJBmfRLozzn18QNLZsK7jUs5z0Ov5tXnm2bKtuuqBW7G4HMEC95nNKb6XCA9qu4JP9BxW3rCiRiCBHlHp19y0IsmCaJA5yFSu39S31+aVZOOCPJv9AYtU6t7M13U7VlwKpDNN5n6kcdZBOAvUuLCvTGnTGeUK0GOSWw06wK1VdB6BzoJM0jKU8+AljEZ3TdgQs3h0PYg/3TsBF01Q7Xy4jwG95UQ7PArFl6OuqPTge5pbgBZnCrh5aJpmO0un4lW1EJ7ILnRj9s8gBnn2TG7/en7kra1qTTWQT1fOc85subj1TNW8QTo1OweBeEPverorBofwmJ2qKXX2LkyOplCkBEGH9FgjUgR1IWv0AYgxo5do0Thc0OfLEVY9ubEJxNroTk3ouzT/TRm1NSfwGiSODwacW7L9eVLr067GeiWAvdC0UmcVupEKlikDftHVMzcCc08nZ3q4hshLOdKk5WnxaYt6+VhcyDBX8fsGy0HTo2z81WLmxANXes0NLCjW4kSGFl12l2eweh7lU4Z9WwGZVPpKWEQgXX1SQr4NIiXfdDCI3vI8C76/uhzDEYkY+tgyFwxIKfgqEROslIu3G+c568H7xuwVplN5BMyriczq7HcjJCrC3/V M3KRQP1x gVCTov+4M+we18/5hFbKezXURw6jryccMrzoHjFDL6uXpROQcULOGsZS35rBuCMTd1H9K1ENi5ApBIa48j8O4cY6NKFbFNbWmWkLyJHobR/LbWMooxSt4yS4ATX7knE7Zu8V6jWpPlfYtFGzhLTCoakdbxTzpRdPb3+ZLHKqzpdyymyxAA8+W54AZQf9oOXGMGneV5RaffCXa+TRqEUGHEX4w0a9iSy+WB6YnhCtQb5yxicYWuY6cWgAzattMDpQ0Ltq+KaFvm6hMg6O9SLhc4mRZHqbiJFnuXUqwfaUtLRv0qu1IHBNgHGyOQd48PCto66zyywv6GwcIf5eHb5VrDuWImuxKEIeoHxziKipN0fNIqpniPLMsm/T7m5K+QB/uhJ0V3NSDg7Rlw4AXgm9oQODjp10jYJbHsn3Z3gR6mcGca+A= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 5/15/26 18:46, Brendan Jackman wrote: > On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote: >> On 3/20/26 19:23, Brendan Jackman wrote: >>> Currently __GFP_UNMAPPED allocs will always fail because, although the >>> lists exist to hold them, there is no way to actually create an unmapped >>> page block. This commit adds one, and also the logic to map it back >>> again when that's needed. >>> >>> Doing this at pageblock granularity ensures that the pageblock flags can >>> be used to infer which freetype a page belongs to. It also provides nice >>> batching of TLB flushes, and also avoids creating too much unnecessary >>> TLB fragmentation in the physmap. >>> >>> There are some functional requirements for flipping a block: >>> >>> - Unmapping requires a TLB shootdown, meaning IRQs must be enabled. >>> >>> - Because the main usecase of this feature is to protect against CPU >>> exploits, when a block is mapped it needs to be zeroed to ensure no >>> residual data is available to attackers. Zeroing a block with a >>> spinlock held seems undesirable. >> >> Did I overlook something or this patch doesn't do this whole block zeroing? >> Or is it handled by set_direct_map_valid_noflush itself? > > Oops. At some point I was planning to defer the zeroing to another > series. I changed my mind about that but, apparently I forgot to > actually add the code back. > > The code I deleted was in __rmqueue_direct_map() like this: > > if (want_mapped) { > > } else { > unsigned long start = (unsigned long)page_address(page); > unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT)); > > flush_tlb_kernel_range(start, end); > } > > But actually I'm not sure that's what we want: At the moment, there's > actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO: > > 1. Take page off freelist > 2. Mermap it > 3. Zero it > 4. Mer-unmap it > > I don't know, but some sort of CPU attack might support exploiting the > gap between 2 and 3 to leak any data left behind from a prior > allocation. (Like, maybe you can get the data into a uarch buffer during > the race window, then leak that data afterwards at leisure). I can't imagine how it would work, but then novel CPU attacks might be beyond my imagination :) But I think we can ignore hypothetical CPU attacks for now? > To mitigate that, we might want to effectively enforce > want_init_on_free() for unmapped blocks. And, if we do that, we > don't actually need to zero the block when flipping it back to mapped, > since there shouldn't be any user data in there. > > Any thoughts on that? I have not tried to implement it yet, I might be > missing something that makes it impractical. Also I haven't read that > series that's doing zeroing through user addresses either, this might > have an interesting interaction with that. I think let's try the simplest way first. >>> - Updating the pagetables might require allocating a pagetable to break >>> down a huge page. This would deadlock if the zone lock was held. >>> >>> This makes allocations that need to change sensitivity _somewhat_ >>> similar to those that need to fallback to a different migratetype. But, >>> the locking requirements mean that this can't just be squashed into the >>> existing "fallback" allocator logic, instead a new allocator path just >>> for this purpose is needed. >>> >>> The new path is assumed to be much cheaper than the really heavyweight >>> stuff like compaction and reclaim. But at present it is treated as less >> >> Uhh, speaking of compaction and reclaim... we rely on finding a whole free >> pageblock in order to flip it. If that doesn't exist, the whole >> get_page_from_freelist() will fail, and we might enter the >> reclaim/compaction cycle in __allow_pages_slowpath(). But since we might >> ultimately want an order-0 allocation, there won't be any compaction >> attempted, because that code won't know we failed to flip a pageblock. And >> the watermarks might look good and prevent reclaim as well I think? We >> should somehow indicate this, and handle accordingly. Might not be trivial. >> Or maybe reuse pageblock isolation code to do the migrations directly in >> __rmqueue_direct_map? > > Ah, thanks, I suspect you are right. > > I did fear there would be some sort of case where this "not-quite > reclaim" interacted badly with the actual reclaim, and I tried to test > it by running some stuff in parallel with stress-ng (allocating > __GFP_UNMAPPED via secretmem), and I didn't see a difference in the > effective availability of memory. However, I suspect testing this is > quite a deep art my "run these two commands that I copy pasted from an > LLM suggestion" test was just crap. > > Do you have any workloads you can suggest for evaluating this kinda > thing? We would definitely see it in Google prod (I think we see this > kind of issue with our shrinker-based internal version of ASI distorting > reclaim behaviour in ways even more subtle than this) but that is not a > very practical experimental cycle... Your test seems a good way to start. I realized afterwards that the solution might be something similar to how we handle ALLOC_NOFRAGMENT. >>> >>> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED >>> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */ >>> +static inline struct page * >>> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order, >>> + unsigned int alloc_flags, freetype_t freetype) >>> +{ >>> + unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED; >>> + freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype), >>> + ft_flags_other); >>> + bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED); >>> + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; >> >> Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well, >> not just the unmapped flag? > > Oh right, actually I think we need to do RMQUEUE_CLAIM _and_ > RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports > allocating from blocks that already have the requested migratetype). > > If we just switch it over to just RMQUEUE_CLAIM right now, while only > one migrateteype supports FREETYPE_UNMAPPED, I think that would actually > be broken: When allocating an unmapped block, (want_mapped=true) we > would always hit the freetype_idx<0 case in find_suitable_fallback(). Right. > But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss > opportunities to allocate from other unmapped freetypes once those > exist. > >>> + unsigned long irq_flags; >>> + int nr_pageblocks; >>> + struct page *page; >>> + int alloc_order; >>> + int err; >>> + >>> + if (freetype_idx(ft_other) < 0) >>> + return NULL; >>> + >>> + /* >>> + * Might need a TLB shootdown. Even if IRQs are on this isn't >>> + * safe if the caller holds a lock (in case the other CPUs need that >>> + * lock to handle the shootdown IPI). >>> + */ >>> + if (alloc_flags & ALLOC_NOBLOCK) >>> + return NULL; >>> + >>> + if (!can_set_direct_map()) >>> + return NULL; >>> + >>> + lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled)); >>> + >>> + /* >>> + * Need to [un]map a whole pageblock (otherwise it might require >>> + * allocating pagetables). First allocate it. >>> + */ >>> + alloc_order = max(request_order, pageblock_order); >>> + nr_pageblocks = 1 << (alloc_order - pageblock_order); >>> + zone_lock_irqsave(zone, irq_flags); >>> + page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm); >>> + zone_unlock_irqrestore(zone, irq_flags); >>> + if (!page) >>> + return NULL; >>> + >>> + /* >>> + * Now that IRQs are on it's safe to do a TLB shootdown, and now that we >>> + * released the zone lock it's possible to allocate a pagetable if >>> + * needed to split up a huge page. >>> + * >>> + * Note that modifying the direct map may need to allocate pagetables. >>> + * What about unbounded recursion? Here are the assumptions that make it >>> + * safe: >>> + * >>> + * - The direct map starts out fully mapped at boot. (This is not really >>> + * an assumption" as its in direct control of page_alloc.c). >>> + * >>> + * - Once pages in the direct map are broken down, they are not >>> + * re-aggregated into larger pages again. >>> + * >>> + * - Pagetables are never allocated with __GFP_UNMAPPED. >>> + * >>> + * Under these assumptions, a pagetable might need to be allocated while >>> + * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED >>> + * allocation. But, the allocation of that pagetable never requires >>> + * allocating a further pagetable. >>> + */ >>> + err = set_direct_map_valid_noflush(page, >>> + nr_pageblocks << pageblock_order, want_mapped); >>> + if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) { >>> + zone_lock_irqsave(zone, irq_flags); >>> + __free_one_page(page, page_to_pfn(page), zone, >>> + alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY); >>> + zone_unlock_irqrestore(zone, irq_flags); >>> + return NULL; >>> + } >>> + >>> + if (!want_mapped) { >>> + unsigned long start = (unsigned long)page_address(page); >>> + unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT)); >>> + >>> + flush_tlb_kernel_range(start, end); >>> + } >>> + >>> + for (int i = 0; i < nr_pageblocks; i++) { >>> + struct page *block_page = page + (pageblock_nr_pages * i); >>> + >>> + set_pageblock_freetype_flags(block_page, freetype_flags(freetype)); >>> + } >>> + >>> + if (request_order >= alloc_order) >>> + return page; >>> + >>> + /* Free any remaining pages in the block. */ >>> + zone_lock_irqsave(zone, irq_flags); >>> + for (unsigned int i = request_order; i < alloc_order; i++) { >>> + struct page *page_to_free = page + (1 << i); >>> + >>> + __free_one_page(page_to_free, page_to_pfn(page_to_free), zone, >>> + i, freetype, FPI_SKIP_REPORT_NOTIFY); >>> + } >> >> Could expand() be used here? > > Hm, good point. It should probably look like what try_to_claim_block() > does... > > Instead of figuring that out right now I'll just say this: if that works > I'll do it, if I find a reason why it doesn't I will add a comment > explaining it in the next version. Sounds good. > BTW my thinking is that clarity is the only important factor here, I am > confident that any speedup from this would disappear in the noise of the > TLB flushing etc. But, if it works then yeah I think it would actually > be clearer. Sure clarity is important, but also if we have multiple functions doing similar thing instead of sharing code, there's a risk a future change to the more common code will miss the new one, etc. > Thanks very much for this review, I really appreciate it!