[PATCH] mm/page_alloc: skip high atomic reservation at or below costly order

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
@ 2026-05-19  1:25 JP Kobryn (Meta)
  2026-05-19 19:27 ` Andrew Morton
  2026-05-19 20:28 ` Johannes Weiner
  0 siblings, 2 replies; 4+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-19  1:25 UTC (permalink / raw)
  To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm
  Cc: usama.arif, kirill, willy, linux-kernel, kernel-team

We're seeing a pattern in production where 2MB THP order-9 allocations are
failing due to fragmentation and triggering reclaim on systems with plenty
of free memory. Over time, the success rate of these THP allocations do not
increase at all.

Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
indicated the given zone had sufficient free pages for order-9 allocations,
yet they were going unused. Drilling down into the zone and inspecting
/proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
zone's HighAtomic bucket (while zero were present in Movable). THP is
unable to draw blocks from HighAtomic since that bucket is not in the
fallback list.

The heuristic for reserving pageblocks in HighAtomic is that any atomic
allocation greater than order-0 will result in the full pageblock being
captured. This means that an order-1 atomic allocation will over-reserve by
256x, a full 512 pageblock.

Gate the reservation on order. Skip for allocations at or below
PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
reserving entire pageblocks, and significantly helps when THP is in use on
a fragmented but otherwise healthy system.

Testing was performed using an A/B instagram workload receiving prod
traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
several gains:

Unpatched
HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
  ...all order-9 blocks in HighAtomic
THP success rate: 1-6%
Compaction success rate: 0-2%
pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
Atomic order-4+ allocations: 0

Patched
HighAtomic pageblocks per host: 1
THP success rate: 44-78%
Compaction success rate: 24-47%
pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
Atomic order-4+ allocations: 0

Note that for this workload all atomic allocations were order 0-3
originating from the network stack, btrfs, and scheduler.

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e262d1316259d..45d8f6844f510 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
 	int mt;
 	unsigned long max_managed;

+	/*
+	 * Don't reserve a pageblock for lower orders.
+	 * Order 1-3 allocs should not capture a huge page size block.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return;
+
 	/*
 	 * The number reserved as: minimum is 1 pageblock, maximum is
 	 * roughly 1% of a zone. But if 1% of a zone falls below a
-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
@ 2026-05-19 19:27 ` Andrew Morton
  2026-05-19 23:25   ` JP Kobryn (Meta)
  2026-05-19 20:28 ` Johannes Weiner
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2026-05-19 19:27 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	usama.arif, kirill, willy, linux-kernel, kernel-team

On Mon, 18 May 2026 18:25:32 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:

> We're seeing a pattern in production where 2MB THP order-9 allocations are
> failing due to fragmentation and triggering reclaim on systems with plenty
> of free memory. Over time, the success rate of these THP allocations do not
> increase at all.
> 
> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
> indicated the given zone had sufficient free pages for order-9 allocations,
> yet they were going unused. Drilling down into the zone and inspecting
> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
> zone's HighAtomic bucket (while zero were present in Movable). THP is
> unable to draw blocks from HighAtomic since that bucket is not in the
> fallback list.
> 
> The heuristic for reserving pageblocks in HighAtomic is that any atomic
> allocation greater than order-0 will result in the full pageblock being
> captured. This means that an order-1 atomic allocation will over-reserve by
> 256x, a full 512 pageblock.
> 
> Gate the reservation on order. Skip for allocations at or below
> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
> reserving entire pageblocks, and significantly helps when THP is in use on
> a fragmented but otherwise healthy system.
> 
> Testing was performed using an A/B instagram workload receiving prod
> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
> several gains:
> 
> Unpatched
> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>   ...all order-9 blocks in HighAtomic
> THP success rate: 1-6%
> Compaction success rate: 0-2%
> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
> Atomic order-4+ allocations: 0
> 
> Patched
> HighAtomic pageblocks per host: 1
> THP success rate: 44-78%
> Compaction success rate: 24-47%
> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
> Atomic order-4+ allocations: 0
> 
> Note that for this workload all atomic allocations were order 0-3
> originating from the network stack, btrfs, and scheduler.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
>  	int mt;
>  	unsigned long max_managed;
>  
> +	/*
> +	 * Don't reserve a pageblock for lower orders.
> +	 * Order 1-3 allocs should not capture a huge page size block.
> +	 */
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return;
> +
>  	/*
>  	 * The number reserved as: minimum is 1 pageblock, maximum is
>  	 * roughly 1% of a zone. But if 1% of a zone falls below a

Sashiko asked

: Does skipping the HighAtomic reservation for orders 1-3 break the
: anti-fragmentation guarantees for these atomic allocations?
: 
: The MIGRATE_HIGHATOMIC reserve protects high-order atomic allocations
: from failing under fragmentation by taking ownership of the entire
: pageblock.
: 
: If order-1 through order-3 atomic allocations fall back to stealing
: pages, but the pageblock remains in its original migratetype, won't
: order-0 non-atomic allocations consume the remaining contiguous space?
: 
: Under memory pressure, this could leave no contiguous blocks for atomic
: allocations to steal.  Because these atomic allocations cannot trigger
: direct reclaim or compaction, they might fail, potentially leading to
: dropped packets or I/O errors in subsystems like the network stack or
: BTRFS.
: 
: Could background compaction or khugepaged be used to unreserve
: HighAtomic blocks dynamically instead of disabling the reserve for
: these orders?



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
  2026-05-19 19:27 ` Andrew Morton
@ 2026-05-19 20:28 ` Johannes Weiner
  1 sibling, 0 replies; 4+ messages in thread
From: Johannes Weiner @ 2026-05-19 20:28 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif,
	kirill, willy, linux-kernel, kernel-team

On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
> We're seeing a pattern in production where 2MB THP order-9 allocations are
> failing due to fragmentation and triggering reclaim on systems with plenty
> of free memory. Over time, the success rate of these THP allocations do not
> increase at all.
> 
> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
> indicated the given zone had sufficient free pages for order-9 allocations,
> yet they were going unused. Drilling down into the zone and inspecting
> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
> zone's HighAtomic bucket (while zero were present in Movable). THP is
> unable to draw blocks from HighAtomic since that bucket is not in the
> fallback list.
> 
> The heuristic for reserving pageblocks in HighAtomic is that any atomic
> allocation greater than order-0 will result in the full pageblock being
> captured. This means that an order-1 atomic allocation will over-reserve by
> 256x, a full 512 pageblock.
> 
> Gate the reservation on order. Skip for allocations at or below
> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
> reserving entire pageblocks, and significantly helps when THP is in use on
> a fragmented but otherwise healthy system.
> 
> Testing was performed using an A/B instagram workload receiving prod
> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
> several gains:
> 
> Unpatched
> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>   ...all order-9 blocks in HighAtomic
> THP success rate: 1-6%
> Compaction success rate: 0-2%
> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
> Atomic order-4+ allocations: 0
> 
> Patched
> HighAtomic pageblocks per host: 1
> THP success rate: 44-78%
> Compaction success rate: 24-47%
> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
> Atomic order-4+ allocations: 0

This is an interesting patch. A couple of thoughts:

1. You disabled the highatomic reserve for this workload and it didn't
seem to matter. Presumably <costly orders don't need the protection.

2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
try reserved space first, and I'd expect things that are commonly
highatomic to be short-lived. Why don't we stop with a couple of
claimed highatomic blocks that get continuously recycled?

3. The impact on THP and compaction success rate is pretty
extreme. How can 1% of memory throw such a wrench into the gears?

Have you tried this with other workloads?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19 19:27 ` Andrew Morton
@ 2026-05-19 23:25   ` JP Kobryn (Meta)
  0 siblings, 0 replies; 4+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-19 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	usama.arif, kirill, willy, linux-kernel, kernel-team

On 5/19/26 12:27 PM, Andrew Morton wrote:
> On Mon, 18 May 2026 18:25:32 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:
> 
>> We're seeing a pattern in production where 2MB THP order-9 allocations are
>> failing due to fragmentation and triggering reclaim on systems with plenty
>> of free memory. Over time, the success rate of these THP allocations do not
>> increase at all.
>>
>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
>> indicated the given zone had sufficient free pages for order-9 allocations,
>> yet they were going unused. Drilling down into the zone and inspecting
>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>> unable to draw blocks from HighAtomic since that bucket is not in the
>> fallback list.
>>
>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>> allocation greater than order-0 will result in the full pageblock being
>> captured. This means that an order-1 atomic allocation will over-reserve by
>> 256x, a full 512 pageblock.
>>
>> Gate the reservation on order. Skip for allocations at or below
>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>> reserving entire pageblocks, and significantly helps when THP is in use on
>> a fragmented but otherwise healthy system.
>>
>> Testing was performed using an A/B instagram workload receiving prod
>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>> several gains:
>>
>> Unpatched
>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>>    ...all order-9 blocks in HighAtomic
>> THP success rate: 1-6%
>> Compaction success rate: 0-2%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>> Atomic order-4+ allocations: 0
>>
>> Patched
>> HighAtomic pageblocks per host: 1
>> THP success rate: 44-78%
>> Compaction success rate: 24-47%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>> Atomic order-4+ allocations: 0
>>
>> Note that for this workload all atomic allocations were order 0-3
>> originating from the network stack, btrfs, and scheduler.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
>>   	int mt;
>>   	unsigned long max_managed;
>>   
>> +	/*
>> +	 * Don't reserve a pageblock for lower orders.
>> +	 * Order 1-3 allocs should not capture a huge page size block.
>> +	 */
>> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
>> +		return;
>> +
>>   	/*
>>   	 * The number reserved as: minimum is 1 pageblock, maximum is
>>   	 * roughly 1% of a zone. But if 1% of a zone falls below a
> 
> Sashiko asked
> 
> : Does skipping the HighAtomic reservation for orders 1-3 break the
> : anti-fragmentation guarantees for these atomic allocations?

The data included in the changelog supports the claim that the reserve
does not provide a benefit at these orders. Even on fragmented systems,
orders 1-3 have plenty of pages available.

> :
> : The MIGRATE_HIGHATOMIC reserve protects high-order atomic allocations
> : from failing under fragmentation by taking ownership of the entire
> : pageblock.

In the experiments, there were no failures for orders 1-3 despite a
fragmented system and no reserved pageblocks for these orders.

> :
> : If order-1 through order-3 atomic allocations fall back to stealing
> : pages, but the pageblock remains in its original migratetype, won't
> : order-0 non-atomic allocations consume the remaining contiguous space?

With the patch, these pageblocks stay movable. So if fallback is needed,
moveable pages can still be taken. But the patch actually improves
compaction so contiguous space is increased overall.

> :
> : Under memory pressure, this could leave no contiguous blocks for atomic
> : allocations to steal.  Because these atomic allocations cannot trigger
> : direct reclaim or compaction, they might fail, potentially leading to
> : dropped packets or I/O errors in subsystems like the network stack or
> : BTRFS.

Reserved HighAtomic pageblocks are not currently treated as a precious
resource. Under real memory pressure, the kernel already gives them up -
the unreserve mechanism kicks in and converts the HighAtomic pageblocks
back to their original migrate type.

> :
> : Could background compaction or khugepaged be used to unreserve
> : HighAtomic blocks dynamically instead of disabling the reserve for
> : these orders?

This would call for extra scanning/overhead/stats. The patch reduces
reclaim, for example. A new scanner feels like going in the opposite
direction.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-19 23:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
2026-05-19 19:27 ` Andrew Morton
2026-05-19 23:25   ` JP Kobryn (Meta)
2026-05-19 20:28 ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox