[PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
@ 2026-05-20 12:22 Dmitry Ilvokhin
  2026-05-21 23:59 ` Andrew Morton
  2026-05-26 13:21 ` Vlastimil Babka (SUSE)
  0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-20 12:22 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan
  Cc: linux-mm, linux-kernel, kernel-team, Dmitry Ilvokhin

When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
migratetype fallbacks and keep pageblocks clean. The allocator relies on
reclaim and compaction to free pages of the correct type before allowing
fallback as a last resort.

However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
direct reclaim or compaction. With defrag_mode=1, these allocations hit
the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.

This causes a large number of SLUB allocation failures for
skbuff_head_cache under network-heavy workloads, despite free memory
being available in other migratetype freelists.

Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
__GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
fallbacks and should not cause fragmentation.

Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
Changes in v2:

- Add check for __GFP_KSWAPD_RECLAIM.
- Picked up Johannes acked-by tag.

v1: https://lore.kernel.org/all/20260518163736.173910-1-d@ilvokhin.com/

 mm/page_alloc.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..c5a077de1be0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4811,8 +4811,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}

 	/* Caller is not willing to reclaim, we can't balance anything */
-	if (!can_direct_reclaim)
+	if (!can_direct_reclaim) {
+		/*
+		 * Reclaim/compaction cannot run, so defrag_mode's strategy
+		 * of enforcing ALLOC_NOFRAGMENT cannot be fulfilled. Allow
+		 * fallbacks rather than failing the allocation outright.
+		 */
+		if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT) &&
+		    (gfp_mask & __GFP_KSWAPD_RECLAIM)) {
+			alloc_flags &= ~ALLOC_NOFRAGMENT;
+			goto retry;
+		}
 		goto nopage;
+	}

 	/* Avoid recursion of direct reclaim */
 	if (current->flags & PF_MEMALLOC)
-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-20 12:22 [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations Dmitry Ilvokhin
@ 2026-05-21 23:59 ` Andrew Morton
  2026-05-22 13:05   ` Dmitry Ilvokhin
  2026-05-26 13:21 ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2026-05-21 23:59 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> reclaim and compaction to free pages of the correct type before allowing
> fallback as a last resort.
> 
> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> 
> This causes a large number of SLUB allocation failures for
> skbuff_head_cache under network-heavy workloads, despite free memory
> being available in other migratetype freelists.

That sounds painful.

> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> fallbacks and should not cause fragmentation.

How serious is this to our users when running real-world workloads?

> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-21 23:59 ` Andrew Morton
@ 2026-05-22 13:05   ` Dmitry Ilvokhin
  2026-05-23  2:54     ` Andrew Morton
  2026-05-26 13:13     ` Vlastimil Babka (SUSE)
  0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-22 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> > migratetype fallbacks and keep pageblocks clean. The allocator relies on
> > reclaim and compaction to free pages of the correct type before allowing
> > fallback as a last resort.
> > 
> > However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> > direct reclaim or compaction. With defrag_mode=1, these allocations hit
> > the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> > ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> > 
> > This causes a large number of SLUB allocation failures for
> > skbuff_head_cache under network-heavy workloads, despite free memory
> > being available in other migratetype freelists.
> 
> That sounds painful.
> 
> > Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> > reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
> > speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> > __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> > fallbacks and should not cause fragmentation.
> 
> How serious is this to our users when running real-world workloads?

We observed it on a few of the Meta workloads that adopted
defrag_mode=1.

For the service under load there were 85509 SLUB allocation failures
messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
for skbuff_head_cache, despite free pages being available in other
migratetype freelists (~13 GB free).

Since it is networking path from the practical point of view, this means
dropped packets, failed RPC requests, tail latency spikes and overall
service degradation.

> 
> > Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> > 
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-22 13:05   ` Dmitry Ilvokhin
@ 2026-05-23  2:54     ` Andrew Morton
  2026-05-23 13:50       ` Dmitry Ilvokhin
  2026-05-26 13:13     ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2026-05-23  2:54 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:

> > How serious is this to our users when running real-world workloads?
> 
> We observed it on a few of the Meta workloads that adopted
> defrag_mode=1.
> 
> For the service under load there were 85509 SLUB allocation failures
> messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
> for skbuff_head_cache, despite free pages being available in other
> migratetype freelists (~13 GB free).

For a single machine, I assume.

> Since it is networking path from the practical point of view, this means
> dropped packets, failed RPC requests, tail latency spikes and overall
> service degradation.

OK, thanks.   I assume 12 failures per second isn't a disaster, and that
there's no need to fast-track this into 7.1?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-23  2:54     ` Andrew Morton
@ 2026-05-23 13:50       ` Dmitry Ilvokhin
  0 siblings, 0 replies; 9+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-23 13:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Fri, May 22, 2026 at 07:54:26PM -0700, Andrew Morton wrote:
> On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> > > How serious is this to our users when running real-world workloads?
> > 
> > We observed it on a few of the Meta workloads that adopted
> > defrag_mode=1.
> > 
> > For the service under load there were 85509 SLUB allocation failures
> > messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
> > for skbuff_head_cache, despite free pages being available in other
> > migratetype freelists (~13 GB free).
> 
> For a single machine, I assume.

Yes, all of that data is from a single machine.

> 
> > Since it is networking path from the practical point of view, this means
> > dropped packets, failed RPC requests, tail latency spikes and overall
> > service degradation.
> 
> OK, thanks.   I assume 12 failures per second isn't a disaster, and that
> there's no need to fast-track this into 7.1?

Yes, I agree. No need to fast-track this.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-22 13:05   ` Dmitry Ilvokhin
  2026-05-23  2:54     ` Andrew Morton
@ 2026-05-26 13:13     ` Vlastimil Babka (SUSE)
  2026-05-26 17:51       ` Johannes Weiner
  1 sibling, 1 reply; 9+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-26 13:13 UTC (permalink / raw)
  To: Dmitry Ilvokhin, Andrew Morton
  Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, linux-mm, linux-kernel, kernel-team

On 5/22/26 3:05 PM, Dmitry Ilvokhin wrote:
> On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
>> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>>
>>> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
>>> migratetype fallbacks and keep pageblocks clean. The allocator relies on
>>> reclaim and compaction to free pages of the correct type before allowing
>>> fallback as a last resort.
>>>
>>> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
>>> direct reclaim or compaction. With defrag_mode=1, these allocations hit
>>> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
>>> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
>>>
>>> This causes a large number of SLUB allocation failures for
>>> skbuff_head_cache under network-heavy workloads, despite free memory
>>> being available in other migratetype freelists.
>>
>> That sounds painful.
>>
>>> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
>>> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
>>> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
>>> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
>>> fallbacks and should not cause fragmentation.
>>
>> How serious is this to our users when running real-world workloads?
> 
> We observed it on a few of the Meta workloads that adopted
> defrag_mode=1.

Do you (or Johannes) have some observations to share about what
motivated those to adopt it, what kind of workloads benefit and how?
Because I have no idea who uses this mode and what are the expectations.

Thanks,
Vlastimil



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-20 12:22 [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations Dmitry Ilvokhin
  2026-05-21 23:59 ` Andrew Morton
@ 2026-05-26 13:21 ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 9+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-26 13:21 UTC (permalink / raw)
  To: Dmitry Ilvokhin, Andrew Morton, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan
  Cc: linux-mm, linux-kernel, kernel-team

On 5/20/26 2:22 PM, Dmitry Ilvokhin wrote:
> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> reclaim and compaction to free pages of the correct type before allowing
> fallback as a last resort.
> 
> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> 
> This causes a large number of SLUB allocation failures for
> skbuff_head_cache under network-heavy workloads, despite free memory
> being available in other migratetype freelists.
> 
> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> fallbacks and should not cause fragmentation.
> 
> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-26 13:13     ` Vlastimil Babka (SUSE)
@ 2026-05-26 17:51       ` Johannes Weiner
  2026-05-27  7:10         ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2026-05-26 17:51 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Dmitry Ilvokhin, Andrew Morton, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Zi Yan, linux-mm, linux-kernel, kernel-team

On Tue, May 26, 2026 at 03:13:09PM +0200, Vlastimil Babka (SUSE) wrote:
> On 5/22/26 3:05 PM, Dmitry Ilvokhin wrote:
> > On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
> >> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> >>
> >>> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> >>> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> >>> reclaim and compaction to free pages of the correct type before allowing
> >>> fallback as a last resort.
> >>>
> >>> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> >>> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> >>> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> >>> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> >>>
> >>> This causes a large number of SLUB allocation failures for
> >>> skbuff_head_cache under network-heavy workloads, despite free memory
> >>> being available in other migratetype freelists.
> >>
> >> That sounds painful.
> >>
> >>> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> >>> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC).  Purely
> >>> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> >>> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> >>> fallbacks and should not cause fragmentation.
> >>
> >> How serious is this to our users when running real-world workloads?
> > 
> > We observed it on a few of the Meta workloads that adopted
> > defrag_mode=1.
> 
> Do you (or Johannes) have some observations to share about what
> motivated those to adopt it, what kind of workloads benefit and how?

As you may remember it was developed to help with higher order / THP
success rates under pressure.

The impetus for actually deploying it was that we saw issues with
avalanches of large page cache folios vacuuming up the higher-order
chunks; this (ironically) also led to failures on the network side.

It's kind of a structural problem. We have real preproduction buffers
for order-0 pages through the watermarks. But for higher orders we
only ensure there is at least one page. That easily fails under even
mild competition.

Since we wanted to roll defrag_mode for THP in multi-tenant systems
anyway, we figured we might as well take the plunge now and battle
test the feature this way.

defrag_mode fixes *that* issue, by preproducing watermark buffers in
contiguous pageblocks - making everything up to that order more
readily available. I'm still hoping to make it the default eventually,
which was the plan with the original huge page allocator series. As we
keep leaning into higher order requests more and more, and especially
grow the non-optional ones, we kind of need non-optional preproduction
guarantees for higher orders as well.

But there are bugs like this one, and we're still figuring out some
overreclaim issues with it in production as well. So I'm glad it's
optional for the time being ;-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
  2026-05-26 17:51       ` Johannes Weiner
@ 2026-05-27  7:10         ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-27  7:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dmitry Ilvokhin, Andrew Morton, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Zi Yan, linux-mm, linux-kernel, kernel-team

On 5/26/26 19:51, Johannes Weiner wrote:
> On Tue, May 26, 2026 at 03:13:09PM +0200, Vlastimil Babka (SUSE) wrote:
> 
> As you may remember it was developed to help with higher order / THP
> success rates under pressure.
> 
> The impetus for actually deploying it was that we saw issues with
> avalanches of large page cache folios vacuuming up the higher-order
> chunks; this (ironically) also led to failures on the network side.
> 
> It's kind of a structural problem. We have real preproduction buffers
> for order-0 pages through the watermarks. But for higher orders we
> only ensure there is at least one page. That easily fails under even
> mild competition.
> 
> Since we wanted to roll defrag_mode for THP in multi-tenant systems
> anyway, we figured we might as well take the plunge now and battle
> test the feature this way.

Great!

> defrag_mode fixes *that* issue, by preproducing watermark buffers in
> contiguous pageblocks - making everything up to that order more
> readily available. I'm still hoping to make it the default eventually,
> which was the plan with the original huge page allocator series. As we
> keep leaning into higher order requests more and more, and especially
> grow the non-optional ones, we kind of need non-optional preproduction
> guarantees for higher orders as well.
> 
> But there are bugs like this one, and we're still figuring out some
> overreclaim issues with it in production as well. So I'm glad it's
> optional for the time being ;-)

Right :) thanks for sharing!


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-27  7:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 12:22 [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations Dmitry Ilvokhin
2026-05-21 23:59 ` Andrew Morton
2026-05-22 13:05   ` Dmitry Ilvokhin
2026-05-23  2:54     ` Andrew Morton
2026-05-23 13:50       ` Dmitry Ilvokhin
2026-05-26 13:13     ` Vlastimil Babka (SUSE)
2026-05-26 17:51       ` Johannes Weiner
2026-05-27  7:10         ` Vlastimil Babka (SUSE)
2026-05-26 13:21 ` Vlastimil Babka (SUSE)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.