* [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
A subsequent patch will have some order-0 allocations participate in
compaction under defrag_mode, to stave off extfrag events.
Since this is a sprawling expansion of entry points, and compaction
can enter filesystem paths, add lockdep annotations that catches
__GFP_FS passing errors.
Direct reclaim has had this annotation for a while, and since reclaim
and compaction are usually used in conjunction, this is unlikely to
unearth old bugs. It's more about future proofing and peace of mind.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f..cb422505c6ef 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4152,12 +4152,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
psi_memstall_enter(&pflags);
delayacct_compact_start();
+ fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio, &page);
memalloc_noreclaim_restore(noreclaim_flag);
+ fs_reclaim_release(gfp_mask);
psi_memstall_leave(&pflags);
delayacct_compact_end();
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
3 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
While trying to fix a reclaim storm in defrag_mode, I noticed that
non-movable direct compaction is extremely inefficient.
When searching for space to evacuate, compaction only allows blocks of
the same type as the incoming request. This is to prevent migratetype
pollution, where a small non-movable request frees space in a movable
block and provokes the allocator to fall back and pollute it.
This protection is reasonable on one hand, but the downside is that it
makes non-movable direct compaction nearly useless: if we get the type
annotations right, by definition there aren't any movable pages inside
the non-movable blocks it is allowed to scan.
With defrag_mode, the goal is the production of whole blocks, which
are essentially type neutral: __rmqueue_claim() will convert them
wholesale on alloc. This makes type mixing and pollution a non-issue.
Fix the pollution gates to take the requested order into account, and
allow whole-block requests to scan blocks of other types.
The only exception is CMA blocks. That type is sticky and these blocks
cannot be claimed to other types. Continue to be strict with them, and
allow only explicit ALLOC_CMA requests and kcompactd to evacuate them.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/compaction.c | 35 ++++++++++++++++++++++++++++-------
1 file changed, 28 insertions(+), 7 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index f08765ade014..7df3a85d43af 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1381,12 +1381,33 @@ static bool suitable_migration_source(struct compact_control *cc,
if (pageblock_skip_persistent(page))
return false;
- if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
+ /*
+ * Background compaction produces blocks for the zone at
+ * large, with no particular allocation context. Allow all
+ * block types, including CMA.
+ */
+ if (!cc->direct_compaction)
return true;
block_mt = get_pageblock_migratetype(page);
- if (cc->migratetype == MIGRATE_MOVABLE)
+ /*
+ * CMA pages can only be taken by ALLOC_CMA requests. For anybody
+ * else, vacating a CMA block consumes free pages the caller
+ * could have used, and produces free pages it cannot.
+ */
+ if (is_migrate_cma(block_mt) && !(cc->alloc_flags & ALLOC_CMA))
+ return false;
+
+ if (cc->mode != MIGRATE_ASYNC)
+ return true;
+
+ /*
+ * Prevent small unmovable/reclaimable requests from polluting
+ * movable blocks through fallbacks. Whole-block production is
+ * exempt as the allocator claims and converts these.
+ */
+ if (cc->migratetype == MIGRATE_MOVABLE || cc->order >= pageblock_order)
return is_migrate_movable(block_mt);
else
return block_mt == cc->migratetype;
@@ -1974,12 +1995,12 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
return pfn;
/*
- * Only allow kcompactd and direct requests for movable pages to
- * quickly clear out a MOVABLE pageblock for allocation. This
- * reduces the risk that a large movable pageblock is freed for
- * an unmovable/reclaimable small allocation.
+ * Prevent small unmovable/reclaimable requests from polluting
+ * movable blocks through fallbacks. Whole-block production is
+ * exempt as the allocator claims and converts these.
*/
- if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE)
+ if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE &&
+ cc->order < pageblock_order)
return pfn;
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
3 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
The compaction capturing code assumes the allocation request order
and compaction target order are the same. That won't be true once
defrag_mode promotes sub-block allocations to pageblock-order
compaction: compaction targets the larger order, capture should
remain at the original allocation order.
Move the per-task capture_control to the page allocator, so its
fields can carry alloc-side information that compaction's
compact_control does not. Pass the capture_control through
try_to_compact_pages() / compact_zone_order() instead of a bare
struct page **; compact_zone_order() sets capc->cc while running.
task_capc() now also checks capc->cc to handle the new
not-yet-running state.
No functional change.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/compaction.h | 3 ++-
mm/compaction.c | 33 ++++++++++-----------------------
mm/page_alloc.c | 23 +++++++++++++++++++++--
3 files changed, 33 insertions(+), 26 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..66a2f70e9e01 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,7 @@ enum compact_result {
};
struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */
/*
* Number of free order-0 pages that should be available above given watermark
@@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
const struct alloc_context *ac, enum compact_priority prio,
- struct page **page);
+ struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
unsigned long watermark, int highest_zoneidx);
diff --git a/mm/compaction.c b/mm/compaction.c
index 7df3a85d43af..c2701bf1d04e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2791,7 +2791,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int highest_zoneidx,
- struct page **capture)
+ struct capture_control *capc)
{
enum compact_result ret;
struct compact_control cc = {
@@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
- struct capture_control capc = {
- .cc = &cc,
- .page = NULL,
- };
- /*
- * Make sure the structs are really initialized before we expose the
- * capture control, in case we are interrupted and the interrupt handler
- * frees a page.
- */
+ /* See the comment in __alloc_pages_direct_compact() */
barrier();
- WRITE_ONCE(current->capture_control, &capc);
+ WRITE_ONCE(capc->cc, &cc);
- ret = compact_zone(&cc, &capc);
+ ret = compact_zone(&cc, capc);
+
+ WRITE_ONCE(capc->cc, NULL);
- /*
- * Make sure we hide capture control first before we read the captured
- * page pointer, otherwise an interrupt could free and capture a page
- * and we would leak it.
- */
- WRITE_ONCE(current->capture_control, NULL);
- *capture = READ_ONCE(capc.page);
/*
* Technically, it is also possible that compaction is skipped but
* the page is still captured out of luck(IRQ came and freed the page).
* Returning COMPACT_SUCCESS in such cases helps in properly accounting
* the COMPACT[STALL|FAIL] when compaction is skipped.
*/
- if (*capture)
+ if (capc->page)
ret = COMPACT_SUCCESS;
return ret;
@@ -2849,13 +2836,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
* @alloc_flags: The allocation flags of the current allocation
* @ac: The context of current allocation
* @prio: Determines how hard direct compaction should try to succeed
- * @capture: Pointer to free page created by compaction will be stored here
+ * @capc: The context for capturing pages during freeing
*
* This is the main entry point for direct page compaction.
*/
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, struct page **capture)
+ enum compact_priority prio, struct capture_control *capc)
{
struct zoneref *z;
struct zone *zone;
@@ -2883,7 +2870,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
}
status = compact_zone_order(zone, order, gfp_mask, prio,
- alloc_flags, ac->highest_zoneidx, capture);
+ alloc_flags, ac->highest_zoneidx, capc);
rc = max(status, rc);
/* The allocation should succeed, stop compacting */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb422505c6ef..9dee1c47e795 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
{
struct capture_control *capc = current->capture_control;
- return unlikely(capc) &&
+ return unlikely(capc && capc->cc) &&
!(current->flags & PF_KTHREAD) &&
!capc->page &&
capc->cc->zone == zone ? capc : NULL;
@@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
+ struct capture_control capc = {
+ .page = NULL,
+ };
if (!order)
return NULL;
+ /*
+ * Make sure the structs are really initialized before we expose the
+ * capture control, in case we are interrupted and the interrupt handler
+ * frees a page.
+ */
+ barrier();
+ WRITE_ONCE(current->capture_control, &capc);
+
psi_memstall_enter(&pflags);
delayacct_compact_start();
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &page);
+ prio, &capc);
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
psi_memstall_leave(&pflags);
delayacct_compact_end();
+ /*
+ * Make sure we hide capture control first before we read the captured
+ * page pointer, otherwise an interrupt could free and capture a page
+ * and we would leak it.
+ */
+ WRITE_ONCE(current->capture_control, NULL);
+ page = READ_ONCE(capc.page);
+
if (*compact_result == COMPACT_SKIPPED ||
*compact_result == COMPACT_DEFERRED)
return NULL;
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
` (2 preceding siblings ...)
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-06-26 18:29 ` Zi Yan
3 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
As we deployed defrag_mode into Meta production, pressure spikes and
excessive swapping were observed on some workloads. Tracing confirmed
that this is unmovable/reclaimable requests spinning in the allocator
and direct reclaim, causing excessive amounts of swap.
The initial plan for defrag_mode was to rely on kswapd/kcompactd to
produce blocks, and if those are overwhelmed under high pressure, let
the allocator fall back (__rmqueue_steal()) after its retry loops.
However, that retrying results in more reclaim on some of these
workloads than we'd hoped, sometimes excessively so, spurred on by the
!costly order conditions in should_reclaim_retry().
The storms are dependent on the request type. Reclaim will inevitably
make room in existing movable blocks, since that's where the LRU pages
live. So if movable requests retry on reclaim, they make progress.
When non-movable requests spin in reclaim that isn't productive. They
cannot use the individually freed pages, and the process is unlikely
to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
They spin and overreclaim excessively, which tanks performance and
triggers userspace guards like swap exhaustion or pressure based OOM.
To fix this, send non-movable requests, regardless of order, into
pageblock reclaim/compaction. This way, they help move things along to
meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
and excess OOM rates are no longer observed in production.
The longer-term plan is still to have all requests, including the
movable ones, help make blocks to spread the cost of defragmenting
more evenly and fairly; combined with proper watermarking to reduce
allocation latencies in the common case. However, doing this naively
unearths scaling and concurrency limitations in compaction that need
to be addressed first. Promoting just non-movables for now is the
minimally viable bug fix for the above issue.
Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/internal.h | 7 +++++++
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
2 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..1f636cfc859a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1060,6 +1060,13 @@ struct compact_control {
*/
struct capture_control {
struct compact_control *cc;
+ /*
+ * Allocation request order. May differ from the compaction
+ * order: defrag_mode promotes sub-block allocations to
+ * pageblock-order compaction; capture still matches at the
+ * original allocation order so prep_new_page() is consistent.
+ */
+ int order;
struct page *page;
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dee1c47e795..575a99a4c723 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -728,7 +728,7 @@ static inline bool
compaction_capture(struct capture_control *capc, struct page *page,
int order, int migratetype)
{
- if (!capc || order != capc->cc->order)
+ if (!capc || order != capc->order)
return false;
/* Do not accidentally pollute CMA or isolated regions*/
@@ -748,7 +748,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
return false;
if (migratetype != capc->cc->migratetype)
- trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
+ trace_mm_page_alloc_extfrag(page, capc->order, order,
capc->cc->migratetype, migratetype);
capc->page = page;
@@ -4147,10 +4147,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long pflags;
unsigned int noreclaim_flag;
struct capture_control capc = {
+ .order = order,
.page = NULL,
};
+ int compact_order = order;
- if (!order)
+ /*
+ * If fallbacks are not permitted (defrag_mode), we either
+ * need to reclaim space in a block of matching type, or clear
+ * out an entire block to allow __rmqueue_claim() to convert.
+ *
+ * Reclaim by itself is primarily freeing space in movable
+ * blocks, since that's where the LRU pages live. So this
+ * works for movable requests, but not for others.
+ *
+ * For those, promote the order to help make blocks, instead
+ * of spinning in reclaim alone unproductively.
+ */
+ if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
+ compact_order = max(order, pageblock_order);
+
+ if (!compact_order)
return NULL;
/*
@@ -4166,8 +4183,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
- *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &capc);
+ *compact_result = try_to_compact_pages(gfp_mask, compact_order,
+ alloc_flags, ac, prio, &capc);
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
@@ -4203,7 +4220,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zone *zone = page_zone(page);
zone->compact_blockskip_flush = false;
- compaction_defer_reset(zone, order, true);
+ compaction_defer_reset(zone, compact_order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
@@ -4443,9 +4460,14 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
bool drained = false;
+ int reclaim_order = order;
+
+ /* Match the slowpath compaction promotion in __alloc_pages_direct_compact */
+ if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
+ reclaim_order = max(order, pageblock_order);
psi_memstall_enter(&pflags);
- *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
+ *did_some_progress = __perform_reclaim(gfp_mask, reclaim_order, ac);
if (unlikely(!(*did_some_progress)))
goto out;
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
@ 2026-06-26 18:29 ` Zi Yan
2026-06-26 18:43 ` Johannes Weiner
0 siblings, 1 reply; 7+ messages in thread
From: Zi Yan @ 2026-06-26 18:29 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On Fri Jun 26, 2026 at 2:21 PM EDT, Johannes Weiner wrote:
> As we deployed defrag_mode into Meta production, pressure spikes and
> excessive swapping were observed on some workloads. Tracing confirmed
> that this is unmovable/reclaimable requests spinning in the allocator
> and direct reclaim, causing excessive amounts of swap.
>
> The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> produce blocks, and if those are overwhelmed under high pressure, let
> the allocator fall back (__rmqueue_steal()) after its retry loops.
> However, that retrying results in more reclaim on some of these
> workloads than we'd hoped, sometimes excessively so, spurred on by the
> !costly order conditions in should_reclaim_retry().
>
> The storms are dependent on the request type. Reclaim will inevitably
> make room in existing movable blocks, since that's where the LRU pages
> live. So if movable requests retry on reclaim, they make progress.
>
> When non-movable requests spin in reclaim that isn't productive. They
> cannot use the individually freed pages, and the process is unlikely
> to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> They spin and overreclaim excessively, which tanks performance and
> triggers userspace guards like swap exhaustion or pressure based OOM.
>
> To fix this, send non-movable requests, regardless of order, into
> pageblock reclaim/compaction. This way, they help move things along to
> meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> and excess OOM rates are no longer observed in production.
>
> The longer-term plan is still to have all requests, including the
> movable ones, help make blocks to spread the cost of defragmenting
> more evenly and fairly; combined with proper watermarking to reduce
> allocation latencies in the common case. However, doing this naively
> unearths scaling and concurrency limitations in compaction that need
> to be addressed first. Promoting just non-movables for now is the
> minimally viable bug fix for the above issue.
>
> Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
Should be
Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode").
Since I cannot find f38356df6474 in the tree.
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:29 ` Zi Yan
@ 2026-06-26 18:43 ` Johannes Weiner
0 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:43 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Mike Rapoport, linux-mm, linux-kernel
On Fri, Jun 26, 2026 at 02:29:24PM -0400, Zi Yan wrote:
> On Fri Jun 26, 2026 at 2:21 PM EDT, Johannes Weiner wrote:
> > As we deployed defrag_mode into Meta production, pressure spikes and
> > excessive swapping were observed on some workloads. Tracing confirmed
> > that this is unmovable/reclaimable requests spinning in the allocator
> > and direct reclaim, causing excessive amounts of swap.
> >
> > The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> > produce blocks, and if those are overwhelmed under high pressure, let
> > the allocator fall back (__rmqueue_steal()) after its retry loops.
> > However, that retrying results in more reclaim on some of these
> > workloads than we'd hoped, sometimes excessively so, spurred on by the
> > !costly order conditions in should_reclaim_retry().
> >
> > The storms are dependent on the request type. Reclaim will inevitably
> > make room in existing movable blocks, since that's where the LRU pages
> > live. So if movable requests retry on reclaim, they make progress.
> >
> > When non-movable requests spin in reclaim that isn't productive. They
> > cannot use the individually freed pages, and the process is unlikely
> > to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> > They spin and overreclaim excessively, which tanks performance and
> > triggers userspace guards like swap exhaustion or pressure based OOM.
> >
> > To fix this, send non-movable requests, regardless of order, into
> > pageblock reclaim/compaction. This way, they help move things along to
> > meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> > and excess OOM rates are no longer observed in production.
> >
> > The longer-term plan is still to have all requests, including the
> > movable ones, help make blocks to spread the cost of defragmenting
> > more evenly and fairly; combined with proper watermarking to reduce
> > allocation latencies in the common case. However, doing this naively
> > unearths scaling and concurrency limitations in compaction that need
> > to be addressed first. Promoting just non-movables for now is the
> > minimally viable bug fix for the above issue.
> >
> > Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
>
> Should be
> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode").
> Since I cannot find f38356df6474 in the tree.
Oops, indeed. I managed to pull that commit from the old development
branch I still had locally.
^ permalink raw reply [flat|nested] 7+ messages in thread