* [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
@ 2026-05-20 12:22 Dmitry Ilvokhin
2026-05-21 23:59 ` Andrew Morton
0 siblings, 1 reply; 5+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-20 12:22 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan
Cc: linux-mm, linux-kernel, kernel-team, Dmitry Ilvokhin
When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
migratetype fallbacks and keep pageblocks clean. The allocator relies on
reclaim and compaction to free pages of the correct type before allowing
fallback as a last resort.
However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
direct reclaim or compaction. With defrag_mode=1, these allocations hit
the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
This causes a large number of SLUB allocation failures for
skbuff_head_cache under network-heavy workloads, despite free memory
being available in other migratetype freelists.
Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
__GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
fallbacks and should not cause fragmentation.
Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
Changes in v2:
- Add check for __GFP_KSWAPD_RECLAIM.
- Picked up Johannes acked-by tag.
v1: https://lore.kernel.org/all/20260518163736.173910-1-d@ilvokhin.com/
mm/page_alloc.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..c5a077de1be0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4811,8 +4811,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}
/* Caller is not willing to reclaim, we can't balance anything */
- if (!can_direct_reclaim)
+ if (!can_direct_reclaim) {
+ /*
+ * Reclaim/compaction cannot run, so defrag_mode's strategy
+ * of enforcing ALLOC_NOFRAGMENT cannot be fulfilled. Allow
+ * fallbacks rather than failing the allocation outright.
+ */
+ if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT) &&
+ (gfp_mask & __GFP_KSWAPD_RECLAIM)) {
+ alloc_flags &= ~ALLOC_NOFRAGMENT;
+ goto retry;
+ }
goto nopage;
+ }
/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
2026-05-20 12:22 [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations Dmitry Ilvokhin
@ 2026-05-21 23:59 ` Andrew Morton
2026-05-22 13:05 ` Dmitry Ilvokhin
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2026-05-21 23:59 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
kernel-team
On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> reclaim and compaction to free pages of the correct type before allowing
> fallback as a last resort.
>
> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
>
> This causes a large number of SLUB allocation failures for
> skbuff_head_cache under network-heavy workloads, despite free memory
> being available in other migratetype freelists.
That sounds painful.
> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> fallbacks and should not cause fragmentation.
How serious is this to our users when running real-world workloads?
> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
2026-05-21 23:59 ` Andrew Morton
@ 2026-05-22 13:05 ` Dmitry Ilvokhin
2026-05-23 2:54 ` Andrew Morton
0 siblings, 1 reply; 5+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-22 13:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
kernel-team
On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> > When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> > migratetype fallbacks and keep pageblocks clean. The allocator relies on
> > reclaim and compaction to free pages of the correct type before allowing
> > fallback as a last resort.
> >
> > However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> > direct reclaim or compaction. With defrag_mode=1, these allocations hit
> > the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> > ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> >
> > This causes a large number of SLUB allocation failures for
> > skbuff_head_cache under network-heavy workloads, despite free memory
> > being available in other migratetype freelists.
>
> That sounds painful.
>
> > Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> > reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
> > speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> > __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> > fallbacks and should not cause fragmentation.
>
> How serious is this to our users when running real-world workloads?
We observed it on a few of the Meta workloads that adopted
defrag_mode=1.
For the service under load there were 85509 SLUB allocation failures
messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
for skbuff_head_cache, despite free pages being available in other
migratetype freelists (~13 GB free).
Since it is networking path from the practical point of view, this means
dropped packets, failed RPC requests, tail latency spikes and overall
service degradation.
>
> > Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
> >
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
2026-05-22 13:05 ` Dmitry Ilvokhin
@ 2026-05-23 2:54 ` Andrew Morton
2026-05-23 13:50 ` Dmitry Ilvokhin
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2026-05-23 2:54 UTC (permalink / raw)
To: Dmitry Ilvokhin
Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
kernel-team
On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> > How serious is this to our users when running real-world workloads?
>
> We observed it on a few of the Meta workloads that adopted
> defrag_mode=1.
>
> For the service under load there were 85509 SLUB allocation failures
> messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
> for skbuff_head_cache, despite free pages being available in other
> migratetype freelists (~13 GB free).
For a single machine, I assume.
> Since it is networking path from the practical point of view, this means
> dropped packets, failed RPC requests, tail latency spikes and overall
> service degradation.
OK, thanks. I assume 12 failures per second isn't a disaster, and that
there's no need to fast-track this into 7.1?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations
2026-05-23 2:54 ` Andrew Morton
@ 2026-05-23 13:50 ` Dmitry Ilvokhin
0 siblings, 0 replies; 5+ messages in thread
From: Dmitry Ilvokhin @ 2026-05-23 13:50 UTC (permalink / raw)
To: Andrew Morton
Cc: Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, linux-mm, linux-kernel,
kernel-team
On Fri, May 22, 2026 at 07:54:26PM -0700, Andrew Morton wrote:
> On Fri, 22 May 2026 13:05:36 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>
> > > How serious is this to our users when running real-world workloads?
> >
> > We observed it on a few of the Meta workloads that adopted
> > defrag_mode=1.
> >
> > For the service under load there were 85509 SLUB allocation failures
> > messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations
> > for skbuff_head_cache, despite free pages being available in other
> > migratetype freelists (~13 GB free).
>
> For a single machine, I assume.
Yes, all of that data is from a single machine.
>
> > Since it is networking path from the practical point of view, this means
> > dropped packets, failed RPC requests, tail latency spikes and overall
> > service degradation.
>
> OK, thanks. I assume 12 failures per second isn't a disaster, and that
> there's no need to fast-track this into 7.1?
Yes, I agree. No need to fast-track this.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-05-23 13:50 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 12:22 [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations Dmitry Ilvokhin
2026-05-21 23:59 ` Andrew Morton
2026-05-22 13:05 ` Dmitry Ilvokhin
2026-05-23 2:54 ` Andrew Morton
2026-05-23 13:50 ` Dmitry Ilvokhin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox