From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-171.mta1.migadu.com (out-171.mta1.migadu.com [95.215.58.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74DB03F211A
	for <linux-kernel@vger.kernel.org>; Mon, 15 Jun 2026 13:08:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.171
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781528917; cv=none; b=JcgjckGtjqNXG6DCxlmAoioRZpCJexxNpvYjSeHquvpiXSgY3VI6ryEoiFZ0QuWJGSLSTJue7PVnMcAvttJIDF48LCetXo249QyFYbty/HV4TJEymoN94RGVfwg/HrhRlK240VZNu4rAUxBpK/AZo2ZTzbN2RrvpzZTYhZlZJjw=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781528917; c=relaxed/simple;
	bh=0WSiMWO8ByOD17NObGG+wGwtZZXRol9uqkt/BMlDJvI=;
	h=Mime-Version:Content-Type:Date:Message-Id:From:To:Cc:Subject:
	 References:In-Reply-To; b=r/xKTAIMbUMbRZ90MMcjPeH/TaD50bmWHmYrp3fYYd8JGvABD6GTQiJqH6F1U1N4rUYSkpfPPt3cSbLc6SeYew6UDHqsaPGombKLYdm4taQic1P03EAB8Ab23r4fop6Fp7DyoKMECQm/aV22fMrpGv4b1zaD+gTDAZU17YM7sp4=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=h13pkLuv; arc=none smtp.client-ip=95.215.58.171
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="h13pkLuv"
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1781528911;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ZH2JNCY3zsgj304o4Zn97+o+Ib3Vi/W7OCSeMixBVnE=;
	b=h13pkLuvRZ3JEHrvmpC9/2NTC0Lj8d90i7ZFwIgRk7b9+aw/KUB29uPN+CUN8N9XoCBEGz
	HNKytsQS5B+wsVDRzRr1ts4/sig9UR0LadOmWfNMxGUIYFzJHKcnijuVbiQempCiBt6eet
	xPOt0ncDqiaVuviGfYTCiAjDoIkyapQ=
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Mon, 15 Jun 2026 13:08:13 +0000
Message-Id: <DJ9NAB82KJUF.1G67EXZAJQ3RL@linux.dev>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: "Brendan Jackman" <brendan.jackman@linux.dev>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>, "Brendan Jackman"
 <brendan.jackman@linux.dev>, "Brendan Jackman" <jackmanb@google.com>,
 "Borislav Petkov" <bp@alien8.de>, "Dave Hansen"
 <dave.hansen@linux.intel.com>, "Peter Zijlstra" <peterz@infradead.org>,
 "Andrew Morton" <akpm@linux-foundation.org>, "David Hildenbrand"
 <david@kernel.org>, "Wei Xu" <weixugc@google.com>, "Johannes Weiner"
 <hannes@cmpxchg.org>, "Zi Yan" <ziy@nvidia.com>, "Lorenzo Stoakes"
 <ljs@kernel.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <x86@kernel.org>,
 <rppt@kernel.org>, "Sumit Garg" <sumit.garg@oss.qualcomm.com>,
 <derkling@google.com>, <reijiw@google.com>, "Will Deacon"
 <will@kernel.org>, <rientjes@google.com>, "Kalyazin, Nikita"
 <kalyazin@amazon.co.uk>, <patrick.roy@linux.dev>, "Itazuri, Takahiro"
 <itazur@amazon.co.uk>, "Andy Lutomirski" <luto@kernel.org>, "David Kaplan"
 <david.kaplan@amd.com>, "Thomas Gleixner" <tglx@kernel.org>, "Yosry Ahmed"
 <yosry@kernel.org>
Subject: Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED
 allocations
References: <20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com>
 <20260320-page_alloc-unmapped-v2-19-28bf1bd54f41@google.com>
 <7bfda0d8-2a7a-4337-8b55-d0c158df7839@kernel.org>
 <DIJEIELZ5DJU.26LYHOT4WR7A2@google.com>
 <DIV92L7AZOHG.1FKDUXLPZEUK4@linux.dev>
 <adaca36e-5896-407b-9f46-601ed5686575@kernel.org>
 <DJ6AVARSAAQX.MK0WG9C2K84P@linux.dev>
 <4a4470b8-0aeb-4618-8a83-888221965153@kernel.org>
In-Reply-To: <4a4470b8-0aeb-4618-8a83-888221965153@kernel.org>
X-Migadu-Flow: FLOW_OUT

On Mon Jun 15, 2026 at 1:02 PM UTC, Vlastimil Babka (SUSE) wrote:
> On 6/11/26 16:46, Brendan Jackman wrote:
>> On Mon Jun 1, 2026 at 8:50 AM UTC, Vlastimil Babka (SUSE) wrote:
>>> On 5/29/26 17:02, Brendan Jackman wrote:
>>>> On Fri May 15, 2026 at 4:46 PM UTC, Brendan Jackman wrote:
>>>>> On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote:
>>>> [...]
>>>>>> Uhh, speaking of compaction and reclaim... we rely on finding a whol=
e free
>>>>>> pageblock in order to flip it. If that doesn't exist, the whole
>>>>>> get_page_from_freelist() will fail, and we might enter the
>>>>>> reclaim/compaction cycle in __allow_pages_slowpath(). But since we m=
ight
>>>>>> ultimately want an order-0 allocation, there won't be any compaction
>>>>>> attempted, because that code won't know we failed to flip a pagebloc=
k. And
>>>>>> the watermarks might look good and prevent reclaim as well I think? =
We
>>>>>> should somehow indicate this, and handle accordingly. Might not be t=
rivial.
>>>>>> Or maybe reuse pageblock isolation code to do the migrations directl=
y in
>>>>>> __rmqueue_direct_map?
>>>>>
>>>>> Ah, thanks, I suspect you are right.
>>>>>
>>>>> I did fear there would be some sort of case where this "not-quite
>>>>> reclaim" interacted badly with the actual reclaim, and I tried to tes=
t
>>>>> it by running some stuff in parallel with stress-ng (allocating
>>>>> __GFP_UNMAPPED via secretmem), and I didn't see a difference in the
>>>>> effective availability of memory. However, I suspect testing this is
>>>>> quite a deep art my "run these two commands that I copy pasted from a=
n
>>>>> LLM suggestion" test was just crap.
>>>>>
>>>>> Do you have any workloads you can suggest for evaluating this kinda
>>>>> thing? We would definitely see it in Google prod (I think we see this
>>>>> kind of issue with our shrinker-based internal version of ASI distort=
ing
>>>>> reclaim behaviour in ways even more subtle than this) but that is not=
 a
>>>>> very practical experimental cycle...
>>>>=20
>>>> I slop-coded a benchmark:
>>>>=20
>>>> https://github.com/bjackman/kernel-benchmarks-nix/tree/master/packages=
/benchmarks/secretmem-vs-frag
>>>>=20
>>>> It does some mmap/munmap patterns to try and generate fragmentation,
>>>> then spams secretmem allocations until it gets OOM-killed.
>>>>=20
>>>> With this series, I see the OOM-kills happening noticeably sooner on a
>>>> 1GiB VM:
>>>>=20
>>>> metric: secretmem_allocated_bytes (B)   |  test: secretmem-vs-frag
>>>> +---------------------------------------------+---------+-------------=
+-------------+-----------------+-------------+-------+
>>>> | kernel_release                              | samples |        mean =
|         min | histogram       |         max | =CE=94=CE=BC    |
>>>> +---------------------------------------------+---------+-------------=
+-------------+-----------------+-------------+-------+
>>>> | 7.0.0-rc4-next-20260319                     |       4 | 683,147,264 =
| 643,825,664 |               =E2=96=88 | 715,128,832 |       |
>>>> | 7.0.0-rc4-next-20260319-00028-gf00246eb72cd |       3 | 623,553,195 =
| 551,550,976 |            =E2=96=88=E2=96=88=E2=96=88  | 692,060,160 | -8.=
7% |
>>>> +---------------------------------------------+---------+-------------=
+-------------+-----------------+-------------+-------+
>>>>=20
>>>> So... I think maybe I've reproduced the issue you pointed out? I will
>>>> try and fix it and see if this degradation goes away.
>>>
>>> Since I assume the fragmentating allocations are movable allocations, i=
t
>>> might be the case, yeah.
>>=20
>> Alright, so I tried splitting NR_FREE_PAGES_BLOCKS into two counters to
>> track mapped vs unmapped blocks. Then I gave
>> compaction_suit_allocation_order() an 'unmapped' flag:
>>=20
>>=20
>> @@ -2510,19 +2510,39 @@ bool compaction_zonelist_suitable(struct alloc_c=
ontext *ac, int order,
>>  static enum compact_result
>>  compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>>                                  int highest_zoneidx, unsigned int alloc=
_flags,
>> -                                bool async, bool kcompactd)
>> +                                bool unmapped, bool async, bool kcompac=
td)
>>  {
>>         unsigned long free_pages;
>>         unsigned long watermark;
>>=20
>> -       if (kcompactd && defrag_mode)
>> +       /*
>> +        * Might need to generate a whole free block regardless of the a=
ctual
>> +        * allocation order:
>> +        *
>> +        * - When allocating an unmapped page, because the allocator onl=
y unmaps
>> +        *   whole blocks at a time.
>> +        *
>> +        *   Why doesn't this apply to the other way around too? (Mightn=
't we
>> +        *   need to _map_ a whole block?) This is a temporary simplific=
ation:
>> +        *   currently, unmapped blocks don't contain movable pages, so
>> +        *   compaction isn't going to free up one of those.
>> +        *
>> +        * - In defrag_mode, because the allocator is unwilling to "stea=
l" pages
>> +        *   from the "wrong" block.
>> +        *
>> +        *   Why is this only under kcompactd?
>> +        *
>> +        * Temporary simplification: unmapped pageblocks are currently
>> +        * nonmovable. So if the compactor is trying to service a
>> +        */
>> +       if (unmapped)
>> +               free_pages =3D zone_page_state(zone, NR_FREE_PAGES_BLOCK=
S_MAPPED);
>> +       else if (kcompactd && defrag_mode)
>>                 free_pages =3D zone_free_pages_blocks(zone);
>>         else
>>                 free_pages =3D zone_page_state(zone, NR_FREE_PAGES);
>>=20
>>=20
>> ... Then, I changed __alloc_pages_direct_compact() to try to try to
>> compact for a whole block whenever we are trying to allocate an unmapped
>> page (note I think there's an orthogonal bug here where it leaks memory
>> when there's a "captured" compaction):
>>=20
>>=20
>> index 4f04e897c5374..7eed22f3b26eb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -824,6 +824,9 @@ compaction_capture(struct capture_control *capc, str=
uct page *page,
>>             capc_mt !=3D MIGRATE_MOVABLE)
>>                 return false;
>>=20
>> +       if (freetype_flags(freetype) !=3D freetype_flags(capc->cc->freet=
ype))
>> +               return false;
>> +
>>         if (migratetype !=3D capc_mt)
>>                 trace_mm_page_alloc_extfrag(page, capc->cc->order, order=
,
>>                                             capc_mt, migratetype);
>> @@ -4469,20 +4472,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, uns=
igned int order,
>>         struct page *page =3D NULL;
>>         unsigned long pflags;
>>         unsigned int noreclaim_flag;
>> +       unsigned int compact_order =3D order;
>>=20
>> -       if (!order)
>> +       // TODO: Is it OK to always run compaction like this?
>> +       /*
>> +        * Unmapped allocations benefit from compaction even at order 0,=
 because the
>> +        * allocator will actually grab a whole block.
>> +        */
>> +       if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED)
>> +               compact_order =3D pageblock_order;
>> +
>> +       if (!compact_order)
>>                 return NULL;
>>=20
>>         psi_memstall_enter(&pflags);
>>         delayacct_compact_start();
>>         noreclaim_flag =3D memalloc_noreclaim_save();
>>=20
>> -       *compact_result =3D try_to_compact_pages(gfp_mask, order, alloc_=
flags, ac,
>> -                                                               prio, &p=
age);
>> +       // TODO: deal with captured page, if we changed the order it wil=
l have the
>> +       // wrong order. Also check it respects the freetype flags.
>> +       *compact_result =3D try_to_compact_pages(gfp_mask, compact_order=
,
>> +                                              alloc_flags, ac, prio, &p=
age);
>>=20
>>         memalloc_noreclaim_restore(noreclaim_flag);
>>         psi_memstall_leave(&pflags);
>>=20
>> Full code:
>> https://github.com/bjackman/linux/tree/page_alloc-unmapped-2026-06-11
>>=20
>> This makes the regression above (faster OOMs) go away, but it seems like
>> a pretty blunt approach. But then I'm realising I don't really know why =
it
>> matters?
>
> You mean, why does it matter that we don't OOM prematurely? I'd say that
> matters a lot.
>
>> The main thing is presumably that we are more likely to
>> pointlessly attempt compaction or compact more than we need. But in that
>
> I don't understand why that would be the case? If compaction thinks our g=
oal
> is order-0, there won't be any?
>
> Or you mean that it doesn't matter that your approach above is blunt, and
> are talking about the consequences of that blunt approach?

Yeah exactly this - it seems to fix the issue, but I dunno if it's
cheating. It _feels_ like cheating but I actually don't know what it
would break - maybe nothing?