[PATCH] mm/page_alloc: skip high atomic reservation at or below costly order

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
@ 2026-05-19  1:25 JP Kobryn (Meta)
  2026-05-19 19:27 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-19  1:25 UTC (permalink / raw)
  To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm
  Cc: usama.arif, kirill, willy, linux-kernel, kernel-team

We're seeing a pattern in production where 2MB THP order-9 allocations are
failing due to fragmentation and triggering reclaim on systems with plenty
of free memory. Over time, the success rate of these THP allocations do not
increase at all.

Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
indicated the given zone had sufficient free pages for order-9 allocations,
yet they were going unused. Drilling down into the zone and inspecting
/proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
zone's HighAtomic bucket (while zero were present in Movable). THP is
unable to draw blocks from HighAtomic since that bucket is not in the
fallback list.

The heuristic for reserving pageblocks in HighAtomic is that any atomic
allocation greater than order-0 will result in the full pageblock being
captured. This means that an order-1 atomic allocation will over-reserve by
256x, a full 512 pageblock.

Gate the reservation on order. Skip for allocations at or below
PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
reserving entire pageblocks, and significantly helps when THP is in use on
a fragmented but otherwise healthy system.

Testing was performed using an A/B instagram workload receiving prod
traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
several gains:

Unpatched
HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
  ...all order-9 blocks in HighAtomic
THP success rate: 1-6%
Compaction success rate: 0-2%
pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
Atomic order-4+ allocations: 0

Patched
HighAtomic pageblocks per host: 1
THP success rate: 44-78%
Compaction success rate: 24-47%
pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
Atomic order-4+ allocations: 0

Note that for this workload all atomic allocations were order 0-3
originating from the network stack, btrfs, and scheduler.

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e262d1316259d..45d8f6844f510 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
 	int mt;
 	unsigned long max_managed;

+	/*
+	 * Don't reserve a pageblock for lower orders.
+	 * Order 1-3 allocs should not capture a huge page size block.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return;
+
 	/*
 	 * The number reserved as: minimum is 1 pageblock, maximum is
 	 * roughly 1% of a zone. But if 1% of a zone falls below a
-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
@ 2026-05-19 19:27 ` Andrew Morton
  2026-05-19 23:25   ` JP Kobryn (Meta)
  2026-05-19 20:28 ` Johannes Weiner
  2026-05-28 17:09 ` Frank van der Linden
  2 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2026-05-19 19:27 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	usama.arif, kirill, willy, linux-kernel, kernel-team

On Mon, 18 May 2026 18:25:32 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:

> We're seeing a pattern in production where 2MB THP order-9 allocations are
> failing due to fragmentation and triggering reclaim on systems with plenty
> of free memory. Over time, the success rate of these THP allocations do not
> increase at all.
> 
> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
> indicated the given zone had sufficient free pages for order-9 allocations,
> yet they were going unused. Drilling down into the zone and inspecting
> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
> zone's HighAtomic bucket (while zero were present in Movable). THP is
> unable to draw blocks from HighAtomic since that bucket is not in the
> fallback list.
> 
> The heuristic for reserving pageblocks in HighAtomic is that any atomic
> allocation greater than order-0 will result in the full pageblock being
> captured. This means that an order-1 atomic allocation will over-reserve by
> 256x, a full 512 pageblock.
> 
> Gate the reservation on order. Skip for allocations at or below
> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
> reserving entire pageblocks, and significantly helps when THP is in use on
> a fragmented but otherwise healthy system.
> 
> Testing was performed using an A/B instagram workload receiving prod
> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
> several gains:
> 
> Unpatched
> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>   ...all order-9 blocks in HighAtomic
> THP success rate: 1-6%
> Compaction success rate: 0-2%
> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
> Atomic order-4+ allocations: 0
> 
> Patched
> HighAtomic pageblocks per host: 1
> THP success rate: 44-78%
> Compaction success rate: 24-47%
> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
> Atomic order-4+ allocations: 0
> 
> Note that for this workload all atomic allocations were order 0-3
> originating from the network stack, btrfs, and scheduler.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
>  	int mt;
>  	unsigned long max_managed;
>  
> +	/*
> +	 * Don't reserve a pageblock for lower orders.
> +	 * Order 1-3 allocs should not capture a huge page size block.
> +	 */
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return;
> +
>  	/*
>  	 * The number reserved as: minimum is 1 pageblock, maximum is
>  	 * roughly 1% of a zone. But if 1% of a zone falls below a

Sashiko asked

: Does skipping the HighAtomic reservation for orders 1-3 break the
: anti-fragmentation guarantees for these atomic allocations?
: 
: The MIGRATE_HIGHATOMIC reserve protects high-order atomic allocations
: from failing under fragmentation by taking ownership of the entire
: pageblock.
: 
: If order-1 through order-3 atomic allocations fall back to stealing
: pages, but the pageblock remains in its original migratetype, won't
: order-0 non-atomic allocations consume the remaining contiguous space?
: 
: Under memory pressure, this could leave no contiguous blocks for atomic
: allocations to steal.  Because these atomic allocations cannot trigger
: direct reclaim or compaction, they might fail, potentially leading to
: dropped packets or I/O errors in subsystems like the network stack or
: BTRFS.
: 
: Could background compaction or khugepaged be used to unreserve
: HighAtomic blocks dynamically instead of disabling the reserve for
: these orders?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
  2026-05-19 19:27 ` Andrew Morton
@ 2026-05-19 20:28 ` Johannes Weiner
  2026-05-25  9:11   ` Vlastimil Babka (SUSE)
  2026-05-27  2:33   ` JP Kobryn
  2026-05-28 17:09 ` Frank van der Linden
  2 siblings, 2 replies; 11+ messages in thread
From: Johannes Weiner @ 2026-05-19 20:28 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif,
	kirill, willy, linux-kernel, kernel-team

On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
> We're seeing a pattern in production where 2MB THP order-9 allocations are
> failing due to fragmentation and triggering reclaim on systems with plenty
> of free memory. Over time, the success rate of these THP allocations do not
> increase at all.
> 
> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
> indicated the given zone had sufficient free pages for order-9 allocations,
> yet they were going unused. Drilling down into the zone and inspecting
> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
> zone's HighAtomic bucket (while zero were present in Movable). THP is
> unable to draw blocks from HighAtomic since that bucket is not in the
> fallback list.
> 
> The heuristic for reserving pageblocks in HighAtomic is that any atomic
> allocation greater than order-0 will result in the full pageblock being
> captured. This means that an order-1 atomic allocation will over-reserve by
> 256x, a full 512 pageblock.
> 
> Gate the reservation on order. Skip for allocations at or below
> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
> reserving entire pageblocks, and significantly helps when THP is in use on
> a fragmented but otherwise healthy system.
> 
> Testing was performed using an A/B instagram workload receiving prod
> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
> several gains:
> 
> Unpatched
> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>   ...all order-9 blocks in HighAtomic
> THP success rate: 1-6%
> Compaction success rate: 0-2%
> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
> Atomic order-4+ allocations: 0
> 
> Patched
> HighAtomic pageblocks per host: 1
> THP success rate: 44-78%
> Compaction success rate: 24-47%
> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
> Atomic order-4+ allocations: 0

This is an interesting patch. A couple of thoughts:

1. You disabled the highatomic reserve for this workload and it didn't
seem to matter. Presumably <costly orders don't need the protection.

2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
try reserved space first, and I'd expect things that are commonly
highatomic to be short-lived. Why don't we stop with a couple of
claimed highatomic blocks that get continuously recycled?

3. The impact on THP and compaction success rate is pretty
extreme. How can 1% of memory throw such a wrench into the gears?

Have you tried this with other workloads?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19 19:27 ` Andrew Morton
@ 2026-05-19 23:25   ` JP Kobryn (Meta)
  0 siblings, 0 replies; 11+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-19 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	usama.arif, kirill, willy, linux-kernel, kernel-team

On 5/19/26 12:27 PM, Andrew Morton wrote:
> On Mon, 18 May 2026 18:25:32 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:
> 
>> We're seeing a pattern in production where 2MB THP order-9 allocations are
>> failing due to fragmentation and triggering reclaim on systems with plenty
>> of free memory. Over time, the success rate of these THP allocations do not
>> increase at all.
>>
>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
>> indicated the given zone had sufficient free pages for order-9 allocations,
>> yet they were going unused. Drilling down into the zone and inspecting
>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>> unable to draw blocks from HighAtomic since that bucket is not in the
>> fallback list.
>>
>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>> allocation greater than order-0 will result in the full pageblock being
>> captured. This means that an order-1 atomic allocation will over-reserve by
>> 256x, a full 512 pageblock.
>>
>> Gate the reservation on order. Skip for allocations at or below
>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>> reserving entire pageblocks, and significantly helps when THP is in use on
>> a fragmented but otherwise healthy system.
>>
>> Testing was performed using an A/B instagram workload receiving prod
>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>> several gains:
>>
>> Unpatched
>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>>    ...all order-9 blocks in HighAtomic
>> THP success rate: 1-6%
>> Compaction success rate: 0-2%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>> Atomic order-4+ allocations: 0
>>
>> Patched
>> HighAtomic pageblocks per host: 1
>> THP success rate: 44-78%
>> Compaction success rate: 24-47%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>> Atomic order-4+ allocations: 0
>>
>> Note that for this workload all atomic allocations were order 0-3
>> originating from the network stack, btrfs, and scheduler.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page *page, int order,
>>   	int mt;
>>   	unsigned long max_managed;
>>   
>> +	/*
>> +	 * Don't reserve a pageblock for lower orders.
>> +	 * Order 1-3 allocs should not capture a huge page size block.
>> +	 */
>> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
>> +		return;
>> +
>>   	/*
>>   	 * The number reserved as: minimum is 1 pageblock, maximum is
>>   	 * roughly 1% of a zone. But if 1% of a zone falls below a
> 
> Sashiko asked
> 
> : Does skipping the HighAtomic reservation for orders 1-3 break the
> : anti-fragmentation guarantees for these atomic allocations?

The data included in the changelog supports the claim that the reserve
does not provide a benefit at these orders. Even on fragmented systems,
orders 1-3 have plenty of pages available.

> :
> : The MIGRATE_HIGHATOMIC reserve protects high-order atomic allocations
> : from failing under fragmentation by taking ownership of the entire
> : pageblock.

In the experiments, there were no failures for orders 1-3 despite a
fragmented system and no reserved pageblocks for these orders.

> :
> : If order-1 through order-3 atomic allocations fall back to stealing
> : pages, but the pageblock remains in its original migratetype, won't
> : order-0 non-atomic allocations consume the remaining contiguous space?

With the patch, these pageblocks stay movable. So if fallback is needed,
moveable pages can still be taken. But the patch actually improves
compaction so contiguous space is increased overall.

> :
> : Under memory pressure, this could leave no contiguous blocks for atomic
> : allocations to steal.  Because these atomic allocations cannot trigger
> : direct reclaim or compaction, they might fail, potentially leading to
> : dropped packets or I/O errors in subsystems like the network stack or
> : BTRFS.

Reserved HighAtomic pageblocks are not currently treated as a precious
resource. Under real memory pressure, the kernel already gives them up -
the unreserve mechanism kicks in and converts the HighAtomic pageblocks
back to their original migrate type.

> :
> : Could background compaction or khugepaged be used to unreserve
> : HighAtomic blocks dynamically instead of disabling the reserve for
> : these orders?

This would call for extra scanning/overhead/stats. The patch reduces
reclaim, for example. A new scanner feels like going in the opposite
direction.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19 20:28 ` Johannes Weiner
@ 2026-05-25  9:11   ` Vlastimil Babka (SUSE)
  2026-05-27  5:57     ` JP Kobryn
  2026-05-27  2:33   ` JP Kobryn
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-25  9:11 UTC (permalink / raw)
  To: Johannes Weiner, JP Kobryn (Meta)
  Cc: akpm, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif, kirill,
	willy, linux-kernel, kernel-team

On 5/19/26 22:28, Johannes Weiner wrote:
> On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
>> We're seeing a pattern in production where 2MB THP order-9 allocations are
>> failing due to fragmentation and triggering reclaim on systems with plenty
>> of free memory. Over time, the success rate of these THP allocations do not
>> increase at all.
>> 
>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
>> indicated the given zone had sufficient free pages for order-9 allocations,
>> yet they were going unused. Drilling down into the zone and inspecting
>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>> unable to draw blocks from HighAtomic since that bucket is not in the
>> fallback list.
>> 
>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>> allocation greater than order-0 will result in the full pageblock being
>> captured. This means that an order-1 atomic allocation will over-reserve by
>> 256x, a full 512 pageblock.
>> 
>> Gate the reservation on order. Skip for allocations at or below
>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>> reserving entire pageblocks, and significantly helps when THP is in use on
>> a fragmented but otherwise healthy system.
>> 
>> Testing was performed using an A/B instagram workload receiving prod
>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>> several gains:
>> 
>> Unpatched
>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>>   ...all order-9 blocks in HighAtomic
>> THP success rate: 1-6%
>> Compaction success rate: 0-2%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>> Atomic order-4+ allocations: 0
>> 
>> Patched
>> HighAtomic pageblocks per host: 1
>> THP success rate: 44-78%
>> Compaction success rate: 24-47%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>> Atomic order-4+ allocations: 0
> 
> This is an interesting patch. A couple of thoughts:
> 
> 1. You disabled the highatomic reserve for this workload and it didn't
> seem to matter. Presumably <costly orders don't need the protection.
> 
> 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
> try reserved space first,

Hmm, but if the allocation succeeds before entering slowpath,
ALLOC_NON_BLOCK won't be set.
But reserving another block should mean we already exhausted the reserved ones.
Unreserving is only done when direct reclaim made some progress but failed
to produce a page. But if it works, or kswapd does the job, we won't enter it?

> and I'd expect things that are commonly
> highatomic to be short-lived. Why don't we stop with a couple of
> claimed highatomic blocks that get continuously recycled?

Maybe it's some big burst of highatomic allocations that leads to the
reservations and then they stay around "forever"?

If that's the case I think we should be perhaps looking at the unreserving
being done more proactively, rather than limiting things to costly order.

> 3. The impact on THP and compaction success rate is pretty
> extreme. How can 1% of memory throw such a wrench into the gears?

Maybe if ~all free memory is in the highatomic blocks, compaction can't be
effective much. Or some suitability check somewhere in reclaim+compaction
wrongly assumes the highatomic blocks are usable, so it won't do the work.

> Have you tried this with other workloads?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19 20:28 ` Johannes Weiner
  2026-05-25  9:11   ` Vlastimil Babka (SUSE)
@ 2026-05-27  2:33   ` JP Kobryn
  1 sibling, 0 replies; 11+ messages in thread
From: JP Kobryn @ 2026-05-27  2:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif,
	kirill, willy, linux-kernel, kernel-team

On 5/19/26 1:28 PM, Johannes Weiner wrote:
> On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
>> We're seeing a pattern in production where 2MB THP order-9 
>> allocations are
>> failing due to fragmentation and triggering reclaim on systems with 
>> plenty
>> of free memory. Over time, the success rate of these THP allocations 
>> do not
>> increase at all.
>>
>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on 
>> compaction_suitable()
>> indicated the given zone had sufficient free pages for order-9 
>> allocations,
>> yet they were going unused. Drilling down into the zone and inspecting
>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>> unable to draw blocks from HighAtomic since that bucket is not in the
>> fallback list.
>>
>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>> allocation greater than order-0 will result in the full pageblock being
>> captured. This means that an order-1 atomic allocation will 
>> over-reserve by
>> 256x, a full 512 pageblock.
>>
>> Gate the reservation on order. Skip for allocations at or below
>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>> reserving entire pageblocks, and significantly helps when THP is in 
>> use on
>> a fragmented but otherwise healthy system.
>>
>> Testing was performed using an A/B instagram workload receiving prod
>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>> several gains:
>>
>> Unpatched
>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>> ...all order-9 blocks in HighAtomic
>> THP success rate: 1-6%
>> Compaction success rate: 0-2%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>> Atomic order-4+ allocations: 0
>>
>> Patched
>> HighAtomic pageblocks per host: 1
>> THP success rate: 44-78%
>> Compaction success rate: 24-47%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>> Atomic order-4+ allocations: 0
> This is an interesting patch. A couple of thoughts:
>
> 1. You disabled the highatomic reserve for this workload and it didn't
> seem to matter. Presumably <costly orders don't need the protection.
Right. Although one detail I realize is I should also consider
pageblock_order as well to avoid any config issue.

> 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
> try reserved space first, and I'd expect things that are commonly
> highatomic to be short-lived. Why don't we stop with a couple of
> claimed highatomic blocks that get continuously recycled?

Even though they may be short-lived, the data shows the volume of
allocations is steady enough to keep the reserves maxed out.

> 3. The impact on THP and compaction success rate is pretty
> extreme. How can 1% of memory throw such a wrench into the gears?
Looking at the pre-patched high atomic pageblock counts, that's ~300
pageblocks that could've been used for THPs. They become usable after
the patch.

> Have you tried this with other workloads?
No, but the pre-patch symptoms will show up on workloads where net
allocs are frequent enough to keep the high atomic pageblock count up.
Memory size of hosts involved is a factor as well since it's possible for a
majority of order-9 pages to be stuck in high atomic.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-25  9:11   ` Vlastimil Babka (SUSE)
@ 2026-05-27  5:57     ` JP Kobryn
  2026-05-28 13:57       ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 11+ messages in thread
From: JP Kobryn @ 2026-05-27  5:57 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Johannes Weiner
  Cc: akpm, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif, kirill,
	willy, linux-kernel, kernel-team

On 5/25/26 2:11 AM, Vlastimil Babka (SUSE) wrote:
> On 5/19/26 22:28, Johannes Weiner wrote:
>> On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
>>> We're seeing a pattern in production where 2MB THP order-9 
>>> allocations are
>>> failing due to fragmentation and triggering reclaim on systems with 
>>> plenty
>>> of free memory. Over time, the success rate of these THP allocations 
>>> do not
>>> increase at all.
>>>
>>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on 
>>> compaction_suitable()
>>> indicated the given zone had sufficient free pages for order-9 
>>> allocations,
>>> yet they were going unused. Drilling down into the zone and inspecting
>>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>>> unable to draw blocks from HighAtomic since that bucket is not in the
>>> fallback list.
>>>
>>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>>> allocation greater than order-0 will result in the full pageblock being
>>> captured. This means that an order-1 atomic allocation will 
>>> over-reserve by
>>> 256x, a full 512 pageblock.
>>>
>>> Gate the reservation on order. Skip for allocations at or below
>>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>>> reserving entire pageblocks, and significantly helps when THP is in 
>>> use on
>>> a fragmented but otherwise healthy system.
>>>
>>> Testing was performed using an A/B instagram workload receiving prod
>>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>>> several gains:
>>>
>>> Unpatched
>>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>>> ...all order-9 blocks in HighAtomic
>>> THP success rate: 1-6%
>>> Compaction success rate: 0-2%
>>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>>> Atomic order-4+ allocations: 0
>>>
>>> Patched
>>> HighAtomic pageblocks per host: 1
>>> THP success rate: 44-78%
>>> Compaction success rate: 24-47%
>>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>>> Atomic order-4+ allocations: 0
>> This is an interesting patch. A couple of thoughts:
>>
>> 1. You disabled the highatomic reserve for this workload and it didn't
>> seem to matter. Presumably <costly orders don't need the protection.
>>
>> 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
>> try reserved space first,
> Hmm, but if the allocation succeeds before entering slowpath,
> ALLOC_NON_BLOCK won't be set.
> But reserving another block should mean we already exhausted the 
> reserved ones.
> Unreserving is only done when direct reclaim made some progress but failed
> to produce a page. But if it works, or kswapd does the job, we won't 
> enter it?

There was just no real pressure to invoke the unreserving. Let me know
if I'm misunderstanding the question.

>> and I'd expect things that are commonly
>> highatomic to be short-lived. Why don't we stop with a couple of
>> claimed highatomic blocks that get continuously recycled?
> Maybe it's some big burst of highatomic allocations that leads to the
> reservations and then they stay around "forever"?

I should add to the changelog the missing info that high frequency
net allocations are responsible for these high atomic reservations.
Even though the allocations are not necessarily long-lived, the
pageblocks remain high atomic.

> If that's the case I think we should be perhaps looking at the unreserving
> being done more proactively, rather than limiting things to costly order.

What are your thoughts if we instead look at it as: should we be reserving
full pageblocks for small allocations?

It seems to come down to whether we want the disproportionate protection 
of full
pageblocks (below costly order) for high atomic allocs vs letting them 
coalesce
in the buddy path. Is the data not enough to justify the latter?

>> 3. The impact on THP and compaction success rate is pretty
>> extreme. How can 1% of memory throw such a wrench into the gears?
> Maybe if ~all free memory is in the highatomic blocks, compaction can't be
> effective much. Or some suitability check somewhere in reclaim+compaction
> wrongly assumes the highatomic blocks are usable, so it won't do the work.

I could be missing something, but I spent some time tonight looking into
this and didn't find an issue in the compaction/reclaim suitability path.

__compaction_suitable() calls __zone_watermark_ok(), and that path
subtracts free MIGRATE_HIGHATOMIC pages from usable free memory for
callers without reserve access:

  /*
   * If the caller does not have rights to reserves below the min
   * watermark then subtract the free pages reserved for highatomic.
   */
  if (likely(!(alloc_flags & ALLOC_RESERVES)))
      unusable_free += READ_ONCE(z->nr_free_highatomic);

So free highatomic pages are removed from the usable free count there.

Also, the suitable-free-block check in __zone_watermark_ok() only treats
MIGRATE_HIGHATOMIC as usable when alloc_flags includes
ALLOC_HIGHATOMIC (or ALLOC_OOM). __compaction_suitable() passes
ALLOC_CMA here (not ALLOC_HIGHATOMIC), so I don't think compaction is
incorrectly treating free highatomic blocks as usable.

The only caveat I noticed is the fragmentation accounting side:
fill_contig_page_info() / fragmentation_index() appear to count
free_area[order].nr_free across migratetypes, so fragmentation scoring
may look better than they really are. But that seems adjacent
to this patch.

I think though that by the time we consider reclaim or compaction we're
dealing with the aftermath. The patch prevents the problem from occurring
up front.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-27  5:57     ` JP Kobryn
@ 2026-05-28 13:57       ` Vlastimil Babka (SUSE)
  2026-06-16 19:58         ` JP Kobryn
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-28 13:57 UTC (permalink / raw)
  To: JP Kobryn, Johannes Weiner, Mel Gorman
  Cc: akpm, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif, kirill,
	willy, linux-kernel, kernel-team

On 5/27/26 07:57, JP Kobryn wrote:
> On 5/25/26 2:11 AM, Vlastimil Babka (SUSE) wrote:
>> On 5/19/26 22:28, Johannes Weiner wrote:
>>> On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
>>> This is an interesting patch. A couple of thoughts:
>>>
>>> 1. You disabled the highatomic reserve for this workload and it didn't
>>> seem to matter. Presumably <costly orders don't need the protection.
>>>
>>> 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
>>> try reserved space first,
>> Hmm, but if the allocation succeeds before entering slowpath,
>> ALLOC_NON_BLOCK won't be set.
>> But reserving another block should mean we already exhausted the 
>> reserved ones.
>> Unreserving is only done when direct reclaim made some progress but failed
>> to produce a page. But if it works, or kswapd does the job, we won't 
>> enter it?
> 
> There was just no real pressure to invoke the unreserving. Let me know
> if I'm misunderstanding the question.

Sorry, it was more thinking out loud about Johannes' point than a question.
Yeah it seems there was no real pressure to invoke unreserving.

The reserving side is probably fine. Highatomic allocation will not try the
already reserved blocks in he fastpath, which is maybe not ideal. But they
will try them before reserving another block, and that's the important part.

>>> and I'd expect things that are commonly
>>> highatomic to be short-lived. Why don't we stop with a couple of
>>> claimed highatomic blocks that get continuously recycled?
>> Maybe it's some big burst of highatomic allocations that leads to the
>> reservations and then they stay around "forever"?
> 
> I should add to the changelog the missing info that high frequency
> net allocations are responsible for these high atomic reservations.
> Even though the allocations are not necessarily long-lived, the
> pageblocks remain high atomic.

OK, thanks for the info.

>> If that's the case I think we should be perhaps looking at the unreserving
>> being done more proactively, rather than limiting things to costly order.
> 
> What are your thoughts if we instead look at it as: should we be reserving
> full pageblocks for small allocations?

Well, since migratetypes operate on the pageblock level, so do the
highatomic reservations. It at least groups them together and not scatter
all over random pageblocks?

> It seems to come down to whether we want the disproportionate protection 
> of full
> pageblocks (below costly order) for high atomic allocs vs letting them 
> coalesce
> in the buddy path. Is the data not enough to justify the latter?

I still think the data shows we might be too lax in unreserving.

>>> 3. The impact on THP and compaction success rate is pretty
>>> extreme. How can 1% of memory throw such a wrench into the gears?
>> Maybe if ~all free memory is in the highatomic blocks, compaction can't be
>> effective much. Or some suitability check somewhere in reclaim+compaction
>> wrongly assumes the highatomic blocks are usable, so it won't do the work.
> 
> I could be missing something, but I spent some time tonight looking into
> this and didn't find an issue in the compaction/reclaim suitability path.
> 
> __compaction_suitable() calls __zone_watermark_ok(), and that path
> subtracts free MIGRATE_HIGHATOMIC pages from usable free memory for
> callers without reserve access:
> 
>   /*
>    * If the caller does not have rights to reserves below the min
>    * watermark then subtract the free pages reserved for highatomic.
>    */
>   if (likely(!(alloc_flags & ALLOC_RESERVES)))
>       unusable_free += READ_ONCE(z->nr_free_highatomic);
> 
> So free highatomic pages are removed from the usable free count there.
> 
> Also, the suitable-free-block check in __zone_watermark_ok() only treats
> MIGRATE_HIGHATOMIC as usable when alloc_flags includes
> ALLOC_HIGHATOMIC (or ALLOC_OOM). __compaction_suitable() passes
> ALLOC_CMA here (not ALLOC_HIGHATOMIC), so I don't think compaction is
> incorrectly treating free highatomic blocks as usable.

OK, thanks for checking.

> The only caveat I noticed is the fragmentation accounting side:
> fill_contig_page_info() / fragmentation_index() appear to count
> free_area[order].nr_free across migratetypes, so fragmentation scoring
> may look better than they really are. But that seems adjacent
> to this patch.

Right.

> I think though that by the time we consider reclaim or compaction we're
> dealing with the aftermath. The patch prevents the problem from occurring
> up front.

But I think as a result the highatomic feature is effectively dead. Your
results confirm there are no more Highatomic pageblocks and zero Atomic
order-4+ allocations (actually it's weird there's still 1 highatomic
pageblock with zero allocations that would reserve it, or is that a rounding
error due to calculating average across multiple hosts?).

I think it's not a surprise that there are no costly highatomic allocation
attempts, we've always said they are too easy to fail, so likely nobody even
tries them. MIGRATE_HIGHATOMIC was introduced by Mel [1] and evaluated on
order-1. Even the non-costly orders can fail of course and should have
fallbacks, highatomic reserves are just supposed to make the success more
likely as that improves e.g. the networking receive performance, and they do
use non-costly orders.

Did you observe no increase of net receive fallbacks due to this patch?
Would that be an universal outcome? I.e. did highatomic reservations become
obsolete thanks to other improvements to the page allocator since they were
introduced? That would be great as we could remove it completely and
simplify the code, but we don't know that yet.

If there are still benefits, they probably should stay, but that means keep
them working for non-costly orders, and we should fix the observed problems
differently. I can see two directions to try in that order.

- You say there are "high frequency net allocations" so I assume they are
ongoing. We could try modify the fastpath __alloc_frozen_pages_noprof() to
properly evaluate ALLOC_HIGHATOMIC and let them prefer the reserved blocks
in cases that do not end up in __alloc_pages_slowpath(). This should ensure
the reserved blocks are actually being used even if we are above low
watermarks and don't enter the slowpath.

- If that doesn't help and we still have unused highatomic pageblocks,
figure out how that happens - is the highatomic allocation frequency higher
at some point, resulting in their increase, and then it drops and they stay
around? If yes, think about how to make the unreserving more aggressive than
it currently is.

[1]
https://lore.kernel.org/all/1442832762-7247-10-git-send-email-mgorman@techsingularity.net/



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
  2026-05-19 19:27 ` Andrew Morton
  2026-05-19 20:28 ` Johannes Weiner
@ 2026-05-28 17:09 ` Frank van der Linden
  2026-06-16 20:00   ` JP Kobryn
  2 siblings, 1 reply; 11+ messages in thread
From: Frank van der Linden @ 2026-05-28 17:09 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	usama.arif, kirill, willy, linux-kernel, kernel-team

On Mon, May 18, 2026 at 6:25 PM JP Kobryn (Meta) <jp.kobryn@linux.dev> wrote:
>
> We're seeing a pattern in production where 2MB THP order-9 allocations are
> failing due to fragmentation and triggering reclaim on systems with plenty
> of free memory. Over time, the success rate of these THP allocations do not
> increase at all.
>
> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
> indicated the given zone had sufficient free pages for order-9 allocations,
> yet they were going unused. Drilling down into the zone and inspecting
> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
> zone's HighAtomic bucket (while zero were present in Movable). THP is
> unable to draw blocks from HighAtomic since that bucket is not in the
> fallback list.
>
> The heuristic for reserving pageblocks in HighAtomic is that any atomic
> allocation greater than order-0 will result in the full pageblock being
> captured. This means that an order-1 atomic allocation will over-reserve by
> 256x, a full 512 pageblock.
>
> Gate the reservation on order. Skip for allocations at or below
> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
> reserving entire pageblocks, and significantly helps when THP is in use on
> a fragmented but otherwise healthy system.
>
> Testing was performed using an A/B instagram workload receiving prod
> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
> several gains:
>
> Unpatched
> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>   ...all order-9 blocks in HighAtomic
> THP success rate: 1-6%
> Compaction success rate: 0-2%
> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
> Atomic order-4+ allocations: 0
>
> Patched
> HighAtomic pageblocks per host: 1
> THP success rate: 44-78%
> Compaction success rate: 24-47%
> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
> Atomic order-4+ allocations: 0
>
> Note that for this workload all atomic allocations were order 0-3
> originating from the network stack, btrfs, and scheduler.
>
> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>

Was this issue reproduced with a tree that does not have your patch,
but includes b480cbb07102 ("mm/page_alloc: don't increase highatomic
reserve after pcp alloc") ? The symptoms here seem the same.

- Frank


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-28 13:57       ` Vlastimil Babka (SUSE)
@ 2026-06-16 19:58         ` JP Kobryn
  0 siblings, 0 replies; 11+ messages in thread
From: JP Kobryn @ 2026-06-16 19:58 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Johannes Weiner, Mel Gorman
  Cc: akpm, surenb, mhocko, jackmanb, ziy, linux-mm, usama.arif, kirill,
	willy, linux-kernel, kernel-team

On 5/28/26 6:57 AM, Vlastimil Babka (SUSE) wrote:
> On 5/27/26 07:57, JP Kobryn wrote:
>> On 5/25/26 2:11 AM, Vlastimil Babka (SUSE) wrote:
>>> On 5/19/26 22:28, Johannes Weiner wrote:
>>>> On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:
>>>> This is an interesting patch. A couple of thoughts:
>>>>
>>>> 1. You disabled the highatomic reserve for this workload and it didn't
>>>> seem to matter. Presumably <costly orders don't need the protection.
>>>>
>>>> 2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
>>>> try reserved space first,
>>> Hmm, but if the allocation succeeds before entering slowpath,
>>> ALLOC_NON_BLOCK won't be set.
>>> But reserving another block should mean we already exhausted the 
>>> reserved ones.
>>> Unreserving is only done when direct reclaim made some progress but failed
>>> to produce a page. But if it works, or kswapd does the job, we won't 
>>> enter it?
>>
>> There was just no real pressure to invoke the unreserving. Let me know
>> if I'm misunderstanding the question.
> 
> Sorry, it was more thinking out loud about Johannes' point than a question.
> Yeah it seems there was no real pressure to invoke unreserving.
> 
> The reserving side is probably fine. Highatomic allocation will not try the
> already reserved blocks in he fastpath, which is maybe not ideal. But they
> will try them before reserving another block, and that's the important part.
> 

I sent a patch [0] that addresses this.

>>>> and I'd expect things that are commonly
>>>> highatomic to be short-lived. Why don't we stop with a couple of
>>>> claimed highatomic blocks that get continuously recycled?
>>> Maybe it's some big burst of highatomic allocations that leads to the
>>> reservations and then they stay around "forever"?
>>
>> I should add to the changelog the missing info that high frequency
>> net allocations are responsible for these high atomic reservations.
>> Even though the allocations are not necessarily long-lived, the
>> pageblocks remain high atomic.
> 
> OK, thanks for the info.
> 
>>> If that's the case I think we should be perhaps looking at the unreserving
>>> being done more proactively, rather than limiting things to costly order.
>>
>> What are your thoughts if we instead look at it as: should we be reserving
>> full pageblocks for small allocations?
> 
> Well, since migratetypes operate on the pageblock level, so do the
> highatomic reservations. It at least groups them together and not scatter
> all over random pageblocks?

Right, that's the trade-off. I'm not going to pursue this approach.
Instead, I've been looking for a more targeted fix in the relevant
allocator paths. See this patch [0].

> 
>> It seems to come down to whether we want the disproportionate protection 
>> of full
>> pageblocks (below costly order) for high atomic allocs vs letting them 
>> coalesce
>> in the buddy path. Is the data not enough to justify the latter?
> 
> I still think the data shows we might be too lax in unreserving.

Ack.

> 
>>>> 3. The impact on THP and compaction success rate is pretty
>>>> extreme. How can 1% of memory throw such a wrench into the gears?
>>> Maybe if ~all free memory is in the highatomic blocks, compaction can't be
>>> effective much. Or some suitability check somewhere in reclaim+compaction
>>> wrongly assumes the highatomic blocks are usable, so it won't do the work.
>>
>> I could be missing something, but I spent some time tonight looking into
>> this and didn't find an issue in the compaction/reclaim suitability path.
>>
>> __compaction_suitable() calls __zone_watermark_ok(), and that path
>> subtracts free MIGRATE_HIGHATOMIC pages from usable free memory for
>> callers without reserve access:
>>
>>   /*
>>    * If the caller does not have rights to reserves below the min
>>    * watermark then subtract the free pages reserved for highatomic.
>>    */
>>   if (likely(!(alloc_flags & ALLOC_RESERVES)))
>>       unusable_free += READ_ONCE(z->nr_free_highatomic);
>>
>> So free highatomic pages are removed from the usable free count there.
>>
>> Also, the suitable-free-block check in __zone_watermark_ok() only treats
>> MIGRATE_HIGHATOMIC as usable when alloc_flags includes
>> ALLOC_HIGHATOMIC (or ALLOC_OOM). __compaction_suitable() passes
>> ALLOC_CMA here (not ALLOC_HIGHATOMIC), so I don't think compaction is
>> incorrectly treating free highatomic blocks as usable.
> 
> OK, thanks for checking.
> 
>> The only caveat I noticed is the fragmentation accounting side:
>> fill_contig_page_info() / fragmentation_index() appear to count
>> free_area[order].nr_free across migratetypes, so fragmentation scoring
>> may look better than they really are. But that seems adjacent
>> to this patch.
> 
> Right.
> 
>> I think though that by the time we consider reclaim or compaction we're
>> dealing with the aftermath. The patch prevents the problem from occurring
>> up front.
> 
> But I think as a result the highatomic feature is effectively dead. Your
> results confirm there are no more Highatomic pageblocks and zero Atomic
> order-4+ allocations (actually it's weird there's still 1 highatomic
> pageblock with zero allocations that would reserve it, or is that a rounding
> error due to calculating average across multiple hosts?).

Likely a rounding issue.

> 
> I think it's not a surprise that there are no costly highatomic allocation
> attempts, we've always said they are too easy to fail, so likely nobody even
> tries them. MIGRATE_HIGHATOMIC was introduced by Mel [1] and evaluated on
> order-1. Even the non-costly orders can fail of course and should have
> fallbacks, highatomic reserves are just supposed to make the success more
> likely as that improves e.g. the networking receive performance, and they do
> use non-costly orders.
> 
> Did you observe no increase of net receive fallbacks due to this patch?
> Would that be an universal outcome? I.e. did highatomic reservations become
> obsolete thanks to other improvements to the page allocator since they were
> introduced? That would be great as we could remove it completely and
> simplify the code, but we don't know that yet.

See the separate patch [0] which takes a targeted approach on the
allocator path. It accounts for net fallbacks and should help napi/page
frag allocs in the fastpath.

> 
> If there are still benefits, they probably should stay, but that means keep
> them working for non-costly orders, and we should fix the observed problems
> differently. I can see two directions to try in that order.

Ack.

> 
> - You say there are "high frequency net allocations" so I assume they are
> ongoing. We could try modify the fastpath __alloc_frozen_pages_noprof() to
> properly evaluate ALLOC_HIGHATOMIC and let them prefer the reserved blocks
> in cases that do not end up in __alloc_pages_slowpath(). This should ensure
> the reserved blocks are actually being used even if we are above low
> watermarks and don't enter the slowpath.

Yes, this can be seen in the separate patch [0].

> 
> - If that doesn't help and we still have unused highatomic pageblocks,
> figure out how that happens - is the highatomic allocation frequency higher
> at some point, resulting in their increase, and then it drops and they stay
> around? If yes, think about how to make the unreserving more aggressive than
> it currently is.
> 
> [1]
> https://lore.kernel.org/all/1442832762-7247-10-git-send-email-mgorman@techsingularity.net/
> 

The patch below improves the allocator path. I'll explore opportunities
for unreserving.

[0] https://lore.kernel.org/all/20260616191420.52556-1-jp.kobryn@linux.dev/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order
  2026-05-28 17:09 ` Frank van der Linden
@ 2026-06-16 20:00   ` JP Kobryn
  0 siblings, 0 replies; 11+ messages in thread
From: JP Kobryn @ 2026-06-16 20:00 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	usama.arif, kirill, willy, linux-kernel, kernel-team

On 5/28/26 10:09 AM, Frank van der Linden wrote:
> On Mon, May 18, 2026 at 6:25 PM JP Kobryn (Meta) <jp.kobryn@linux.dev> wrote:
>>
>> We're seeing a pattern in production where 2MB THP order-9 allocations are
>> failing due to fragmentation and triggering reclaim on systems with plenty
>> of free memory. Over time, the success rate of these THP allocations do not
>> increase at all.
>>
>> Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
>> indicated the given zone had sufficient free pages for order-9 allocations,
>> yet they were going unused. Drilling down into the zone and inspecting
>> /proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
>> zone's HighAtomic bucket (while zero were present in Movable). THP is
>> unable to draw blocks from HighAtomic since that bucket is not in the
>> fallback list.
>>
>> The heuristic for reserving pageblocks in HighAtomic is that any atomic
>> allocation greater than order-0 will result in the full pageblock being
>> captured. This means that an order-1 atomic allocation will over-reserve by
>> 256x, a full 512 pageblock.
>>
>> Gate the reservation on order. Skip for allocations at or below
>> PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
>> reserving entire pageblocks, and significantly helps when THP is in use on
>> a fragmented but otherwise healthy system.
>>
>> Testing was performed using an A/B instagram workload receiving prod
>> traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
>> several gains:
>>
>> Unpatched
>> HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
>>   ...all order-9 blocks in HighAtomic
>> THP success rate: 1-6%
>> Compaction success rate: 0-2%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
>> Atomic order-4+ allocations: 0
>>
>> Patched
>> HighAtomic pageblocks per host: 1
>> THP success rate: 44-78%
>> Compaction success rate: 24-47%
>> pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
>> Atomic order-4+ allocations: 0
>>
>> Note that for this workload all atomic allocations were order 0-3
>> originating from the network stack, btrfs, and scheduler.
>>
>> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
> 
> Was this issue reproduced with a tree that does not have your patch,
> but includes b480cbb07102 ("mm/page_alloc: don't increase highatomic
> reserve after pcp alloc") ? The symptoms here seem the same.
> 

No it was not, but thanks for sharing this. I could see this patch
helping a situation like this. See this patch [0] for an update on the
buddy side.

[0] https://lore.kernel.org/all/20260616191420.52556-1-jp.kobryn@linux.dev/



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-06-16 20:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  1:25 [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order JP Kobryn (Meta)
2026-05-19 19:27 ` Andrew Morton
2026-05-19 23:25   ` JP Kobryn (Meta)
2026-05-19 20:28 ` Johannes Weiner
2026-05-25  9:11   ` Vlastimil Babka (SUSE)
2026-05-27  5:57     ` JP Kobryn
2026-05-28 13:57       ` Vlastimil Babka (SUSE)
2026-06-16 19:58         ` JP Kobryn
2026-05-27  2:33   ` JP Kobryn
2026-05-28 17:09 ` Frank van der Linden
2026-06-16 20:00   ` JP Kobryn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox