* [PATCH 0/4] mm: fix reclaim storms in defrag_mode
@ 2026-06-26 18:21 Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
As we deployed vm.defrag_mode=1 into Meta production, some workloads
regressed with recurring pressure spikes and swap storms (which in turn
triggered userspace OOM rules on pressure and swap utilization levels).
Tracing pinned this to non-movable requests spinning and reclaiming
unproductively when kswapd/kcompactd are overwhelmed. Direct reclaim
predominantly frees up pages in movable blocks, but those requests
cannot use that space under defrag_mode rules; and it is unlikely to
free up whole blocks incidentally for __rmqueue_claim() to work.
This series fixes it by making non-movable requests participate in
pageblock production in the allocator slowpath.
That requires some small-ish adjustments up front in the allocator and
the compaction code: three prep patches and the fix last.
The series has been in production against one of the affected workloads
for two weeks and restores the OOM kill rate to !defrag_mode baseline.
Based on mm-new (2026-06-22).
include/linux/compaction.h | 3 +-
mm/compaction.c | 68 ++++++++++++++++++++++++--------------------
mm/internal.h | 7 +++++
mm/page_alloc.c | 59 ++++++++++++++++++++++++++++++------
4 files changed, 98 insertions(+), 39 deletions(-)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-07-01 13:45 ` Vlastimil Babka (SUSE)
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
A subsequent patch will have some order-0 allocations participate in
compaction under defrag_mode, to stave off extfrag events.
Since this is a sprawling expansion of entry points, and compaction
can enter filesystem paths, add lockdep annotations that catches
__GFP_FS passing errors.
Direct reclaim has had this annotation for a while, and since reclaim
and compaction are usually used in conjunction, this is unlikely to
unearth old bugs. It's more about future proofing and peace of mind.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/page_alloc.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f..cb422505c6ef 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4152,12 +4152,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
psi_memstall_enter(&pflags);
delayacct_compact_start();
+ fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio, &page);
memalloc_noreclaim_restore(noreclaim_flag);
+ fs_reclaim_release(gfp_mask);
psi_memstall_leave(&pflags);
delayacct_compact_end();
--
2.54.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-07-01 14:19 ` Vlastimil Babka (SUSE)
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
3 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
While trying to fix a reclaim storm in defrag_mode, I noticed that
non-movable direct compaction is extremely inefficient.
When searching for space to evacuate, compaction only allows blocks of
the same type as the incoming request. This is to prevent migratetype
pollution, where a small non-movable request frees space in a movable
block and provokes the allocator to fall back and pollute it.
This protection is reasonable on one hand, but the downside is that it
makes non-movable direct compaction nearly useless: if we get the type
annotations right, by definition there aren't any movable pages inside
the non-movable blocks it is allowed to scan.
With defrag_mode, the goal is the production of whole blocks, which
are essentially type neutral: __rmqueue_claim() will convert them
wholesale on alloc. This makes type mixing and pollution a non-issue.
Fix the pollution gates to take the requested order into account, and
allow whole-block requests to scan blocks of other types.
The only exception is CMA blocks. That type is sticky and these blocks
cannot be claimed to other types. Continue to be strict with them, and
allow only explicit ALLOC_CMA requests and kcompactd to evacuate them.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/compaction.c | 35 ++++++++++++++++++++++++++++-------
1 file changed, 28 insertions(+), 7 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index f08765ade014..7df3a85d43af 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1381,12 +1381,33 @@ static bool suitable_migration_source(struct compact_control *cc,
if (pageblock_skip_persistent(page))
return false;
- if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
+ /*
+ * Background compaction produces blocks for the zone at
+ * large, with no particular allocation context. Allow all
+ * block types, including CMA.
+ */
+ if (!cc->direct_compaction)
return true;
block_mt = get_pageblock_migratetype(page);
- if (cc->migratetype == MIGRATE_MOVABLE)
+ /*
+ * CMA pages can only be taken by ALLOC_CMA requests. For anybody
+ * else, vacating a CMA block consumes free pages the caller
+ * could have used, and produces free pages it cannot.
+ */
+ if (is_migrate_cma(block_mt) && !(cc->alloc_flags & ALLOC_CMA))
+ return false;
+
+ if (cc->mode != MIGRATE_ASYNC)
+ return true;
+
+ /*
+ * Prevent small unmovable/reclaimable requests from polluting
+ * movable blocks through fallbacks. Whole-block production is
+ * exempt as the allocator claims and converts these.
+ */
+ if (cc->migratetype == MIGRATE_MOVABLE || cc->order >= pageblock_order)
return is_migrate_movable(block_mt);
else
return block_mt == cc->migratetype;
@@ -1974,12 +1995,12 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
return pfn;
/*
- * Only allow kcompactd and direct requests for movable pages to
- * quickly clear out a MOVABLE pageblock for allocation. This
- * reduces the risk that a large movable pageblock is freed for
- * an unmovable/reclaimable small allocation.
+ * Prevent small unmovable/reclaimable requests from polluting
+ * movable blocks through fallbacks. Whole-block production is
+ * exempt as the allocator claims and converts these.
*/
- if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE)
+ if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE &&
+ cc->order < pageblock_order)
return pfn;
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-07-01 18:02 ` Vlastimil Babka (SUSE)
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
3 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
The compaction capturing code assumes the allocation request order
and compaction target order are the same. That won't be true once
defrag_mode promotes sub-block allocations to pageblock-order
compaction: compaction targets the larger order, capture should
remain at the original allocation order.
Move the per-task capture_control to the page allocator, so its
fields can carry alloc-side information that compaction's
compact_control does not. Pass the capture_control through
try_to_compact_pages() / compact_zone_order() instead of a bare
struct page **; compact_zone_order() sets capc->cc while running.
task_capc() now also checks capc->cc to handle the new
not-yet-running state.
No functional change.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/compaction.h | 3 ++-
mm/compaction.c | 33 ++++++++++-----------------------
mm/page_alloc.c | 23 +++++++++++++++++++++--
3 files changed, 33 insertions(+), 26 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..66a2f70e9e01 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,7 @@ enum compact_result {
};
struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */
/*
* Number of free order-0 pages that should be available above given watermark
@@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
const struct alloc_context *ac, enum compact_priority prio,
- struct page **page);
+ struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
unsigned long watermark, int highest_zoneidx);
diff --git a/mm/compaction.c b/mm/compaction.c
index 7df3a85d43af..c2701bf1d04e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2791,7 +2791,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int highest_zoneidx,
- struct page **capture)
+ struct capture_control *capc)
{
enum compact_result ret;
struct compact_control cc = {
@@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
- struct capture_control capc = {
- .cc = &cc,
- .page = NULL,
- };
- /*
- * Make sure the structs are really initialized before we expose the
- * capture control, in case we are interrupted and the interrupt handler
- * frees a page.
- */
+ /* See the comment in __alloc_pages_direct_compact() */
barrier();
- WRITE_ONCE(current->capture_control, &capc);
+ WRITE_ONCE(capc->cc, &cc);
- ret = compact_zone(&cc, &capc);
+ ret = compact_zone(&cc, capc);
+
+ WRITE_ONCE(capc->cc, NULL);
- /*
- * Make sure we hide capture control first before we read the captured
- * page pointer, otherwise an interrupt could free and capture a page
- * and we would leak it.
- */
- WRITE_ONCE(current->capture_control, NULL);
- *capture = READ_ONCE(capc.page);
/*
* Technically, it is also possible that compaction is skipped but
* the page is still captured out of luck(IRQ came and freed the page).
* Returning COMPACT_SUCCESS in such cases helps in properly accounting
* the COMPACT[STALL|FAIL] when compaction is skipped.
*/
- if (*capture)
+ if (capc->page)
ret = COMPACT_SUCCESS;
return ret;
@@ -2849,13 +2836,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
* @alloc_flags: The allocation flags of the current allocation
* @ac: The context of current allocation
* @prio: Determines how hard direct compaction should try to succeed
- * @capture: Pointer to free page created by compaction will be stored here
+ * @capc: The context for capturing pages during freeing
*
* This is the main entry point for direct page compaction.
*/
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, struct page **capture)
+ enum compact_priority prio, struct capture_control *capc)
{
struct zoneref *z;
struct zone *zone;
@@ -2883,7 +2870,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
}
status = compact_zone_order(zone, order, gfp_mask, prio,
- alloc_flags, ac->highest_zoneidx, capture);
+ alloc_flags, ac->highest_zoneidx, capc);
rc = max(status, rc);
/* The allocation should succeed, stop compacting */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb422505c6ef..9dee1c47e795 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
{
struct capture_control *capc = current->capture_control;
- return unlikely(capc) &&
+ return unlikely(capc && capc->cc) &&
!(current->flags & PF_KTHREAD) &&
!capc->page &&
capc->cc->zone == zone ? capc : NULL;
@@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
+ struct capture_control capc = {
+ .page = NULL,
+ };
if (!order)
return NULL;
+ /*
+ * Make sure the structs are really initialized before we expose the
+ * capture control, in case we are interrupted and the interrupt handler
+ * frees a page.
+ */
+ barrier();
+ WRITE_ONCE(current->capture_control, &capc);
+
psi_memstall_enter(&pflags);
delayacct_compact_start();
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &page);
+ prio, &capc);
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
psi_memstall_leave(&pflags);
delayacct_compact_end();
+ /*
+ * Make sure we hide capture control first before we read the captured
+ * page pointer, otherwise an interrupt could free and capture a page
+ * and we would leak it.
+ */
+ WRITE_ONCE(current->capture_control, NULL);
+ page = READ_ONCE(capc.page);
+
if (*compact_result == COMPACT_SKIPPED ||
*compact_result == COMPACT_DEFERRED)
return NULL;
--
2.54.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
` (2 preceding siblings ...)
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
@ 2026-06-26 18:21 ` Johannes Weiner
2026-06-26 18:29 ` Zi Yan
2026-07-01 18:06 ` Vlastimil Babka (SUSE)
3 siblings, 2 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:21 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
As we deployed defrag_mode into Meta production, pressure spikes and
excessive swapping were observed on some workloads. Tracing confirmed
that this is unmovable/reclaimable requests spinning in the allocator
and direct reclaim, causing excessive amounts of swap.
The initial plan for defrag_mode was to rely on kswapd/kcompactd to
produce blocks, and if those are overwhelmed under high pressure, let
the allocator fall back (__rmqueue_steal()) after its retry loops.
However, that retrying results in more reclaim on some of these
workloads than we'd hoped, sometimes excessively so, spurred on by the
!costly order conditions in should_reclaim_retry().
The storms are dependent on the request type. Reclaim will inevitably
make room in existing movable blocks, since that's where the LRU pages
live. So if movable requests retry on reclaim, they make progress.
When non-movable requests spin in reclaim that isn't productive. They
cannot use the individually freed pages, and the process is unlikely
to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
They spin and overreclaim excessively, which tanks performance and
triggers userspace guards like swap exhaustion or pressure based OOM.
To fix this, send non-movable requests, regardless of order, into
pageblock reclaim/compaction. This way, they help move things along to
meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
and excess OOM rates are no longer observed in production.
The longer-term plan is still to have all requests, including the
movable ones, help make blocks to spread the cost of defragmenting
more evenly and fairly; combined with proper watermarking to reduce
allocation latencies in the common case. However, doing this naively
unearths scaling and concurrency limitations in compaction that need
to be addressed first. Promoting just non-movables for now is the
minimally viable bug fix for the above issue.
Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/internal.h | 7 +++++++
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
2 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..1f636cfc859a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1060,6 +1060,13 @@ struct compact_control {
*/
struct capture_control {
struct compact_control *cc;
+ /*
+ * Allocation request order. May differ from the compaction
+ * order: defrag_mode promotes sub-block allocations to
+ * pageblock-order compaction; capture still matches at the
+ * original allocation order so prep_new_page() is consistent.
+ */
+ int order;
struct page *page;
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dee1c47e795..575a99a4c723 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -728,7 +728,7 @@ static inline bool
compaction_capture(struct capture_control *capc, struct page *page,
int order, int migratetype)
{
- if (!capc || order != capc->cc->order)
+ if (!capc || order != capc->order)
return false;
/* Do not accidentally pollute CMA or isolated regions*/
@@ -748,7 +748,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
return false;
if (migratetype != capc->cc->migratetype)
- trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
+ trace_mm_page_alloc_extfrag(page, capc->order, order,
capc->cc->migratetype, migratetype);
capc->page = page;
@@ -4147,10 +4147,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long pflags;
unsigned int noreclaim_flag;
struct capture_control capc = {
+ .order = order,
.page = NULL,
};
+ int compact_order = order;
- if (!order)
+ /*
+ * If fallbacks are not permitted (defrag_mode), we either
+ * need to reclaim space in a block of matching type, or clear
+ * out an entire block to allow __rmqueue_claim() to convert.
+ *
+ * Reclaim by itself is primarily freeing space in movable
+ * blocks, since that's where the LRU pages live. So this
+ * works for movable requests, but not for others.
+ *
+ * For those, promote the order to help make blocks, instead
+ * of spinning in reclaim alone unproductively.
+ */
+ if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
+ compact_order = max(order, pageblock_order);
+
+ if (!compact_order)
return NULL;
/*
@@ -4166,8 +4183,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
- *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &capc);
+ *compact_result = try_to_compact_pages(gfp_mask, compact_order,
+ alloc_flags, ac, prio, &capc);
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
@@ -4203,7 +4220,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zone *zone = page_zone(page);
zone->compact_blockskip_flush = false;
- compaction_defer_reset(zone, order, true);
+ compaction_defer_reset(zone, compact_order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
@@ -4443,9 +4460,14 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
bool drained = false;
+ int reclaim_order = order;
+
+ /* Match the slowpath compaction promotion in __alloc_pages_direct_compact */
+ if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
+ reclaim_order = max(order, pageblock_order);
psi_memstall_enter(&pflags);
- *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
+ *did_some_progress = __perform_reclaim(gfp_mask, reclaim_order, ac);
if (unlikely(!(*did_some_progress)))
goto out;
--
2.54.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
@ 2026-06-26 18:29 ` Zi Yan
2026-06-26 18:43 ` Johannes Weiner
2026-07-01 18:06 ` Vlastimil Babka (SUSE)
1 sibling, 1 reply; 16+ messages in thread
From: Zi Yan @ 2026-06-26 18:29 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton, Vlastimil Babka
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On Fri Jun 26, 2026 at 2:21 PM EDT, Johannes Weiner wrote:
> As we deployed defrag_mode into Meta production, pressure spikes and
> excessive swapping were observed on some workloads. Tracing confirmed
> that this is unmovable/reclaimable requests spinning in the allocator
> and direct reclaim, causing excessive amounts of swap.
>
> The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> produce blocks, and if those are overwhelmed under high pressure, let
> the allocator fall back (__rmqueue_steal()) after its retry loops.
> However, that retrying results in more reclaim on some of these
> workloads than we'd hoped, sometimes excessively so, spurred on by the
> !costly order conditions in should_reclaim_retry().
>
> The storms are dependent on the request type. Reclaim will inevitably
> make room in existing movable blocks, since that's where the LRU pages
> live. So if movable requests retry on reclaim, they make progress.
>
> When non-movable requests spin in reclaim that isn't productive. They
> cannot use the individually freed pages, and the process is unlikely
> to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> They spin and overreclaim excessively, which tanks performance and
> triggers userspace guards like swap exhaustion or pressure based OOM.
>
> To fix this, send non-movable requests, regardless of order, into
> pageblock reclaim/compaction. This way, they help move things along to
> meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> and excess OOM rates are no longer observed in production.
>
> The longer-term plan is still to have all requests, including the
> movable ones, help make blocks to spread the cost of defragmenting
> more evenly and fairly; combined with proper watermarking to reduce
> allocation latencies in the common case. However, doing this naively
> unearths scaling and concurrency limitations in compaction that need
> to be addressed first. Promoting just non-movables for now is the
> minimally viable bug fix for the above issue.
>
> Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
Should be
Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode").
Since I cannot find f38356df6474 in the tree.
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:29 ` Zi Yan
@ 2026-06-26 18:43 ` Johannes Weiner
0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-06-26 18:43 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Mike Rapoport, linux-mm, linux-kernel
On Fri, Jun 26, 2026 at 02:29:24PM -0400, Zi Yan wrote:
> On Fri Jun 26, 2026 at 2:21 PM EDT, Johannes Weiner wrote:
> > As we deployed defrag_mode into Meta production, pressure spikes and
> > excessive swapping were observed on some workloads. Tracing confirmed
> > that this is unmovable/reclaimable requests spinning in the allocator
> > and direct reclaim, causing excessive amounts of swap.
> >
> > The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> > produce blocks, and if those are overwhelmed under high pressure, let
> > the allocator fall back (__rmqueue_steal()) after its retry loops.
> > However, that retrying results in more reclaim on some of these
> > workloads than we'd hoped, sometimes excessively so, spurred on by the
> > !costly order conditions in should_reclaim_retry().
> >
> > The storms are dependent on the request type. Reclaim will inevitably
> > make room in existing movable blocks, since that's where the LRU pages
> > live. So if movable requests retry on reclaim, they make progress.
> >
> > When non-movable requests spin in reclaim that isn't productive. They
> > cannot use the individually freed pages, and the process is unlikely
> > to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> > They spin and overreclaim excessively, which tanks performance and
> > triggers userspace guards like swap exhaustion or pressure based OOM.
> >
> > To fix this, send non-movable requests, regardless of order, into
> > pageblock reclaim/compaction. This way, they help move things along to
> > meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> > and excess OOM rates are no longer observed in production.
> >
> > The longer-term plan is still to have all requests, including the
> > movable ones, help make blocks to spread the cost of defragmenting
> > more evenly and fairly; combined with proper watermarking to reduce
> > allocation latencies in the common case. However, doing this naively
> > unearths scaling and concurrency limitations in compaction that need
> > to be addressed first. Promoting just non-movables for now is the
> > minimally viable bug fix for the above issue.
> >
> > Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
>
> Should be
> Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode").
> Since I cannot find f38356df6474 in the tree.
Oops, indeed. I managed to pull that commit from the old development
branch I still had locally.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
@ 2026-07-01 13:45 ` Vlastimil Babka (SUSE)
0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-07-01 13:45 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On 6/26/26 20:21, Johannes Weiner wrote:
> A subsequent patch will have some order-0 allocations participate in
> compaction under defrag_mode, to stave off extfrag events.
>
> Since this is a sprawling expansion of entry points, and compaction
> can enter filesystem paths, add lockdep annotations that catches
> __GFP_FS passing errors.
>
> Direct reclaim has had this annotation for a while, and since reclaim
> and compaction are usually used in conjunction, this is unlikely to
> unearth old bugs. It's more about future proofing and peace of mind.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> mm/page_alloc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee902a468c2f..cb422505c6ef 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4152,12 +4152,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>
> psi_memstall_enter(&pflags);
> delayacct_compact_start();
> + fs_reclaim_acquire(gfp_mask);
> noreclaim_flag = memalloc_noreclaim_save();
>
> *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> prio, &page);
>
> memalloc_noreclaim_restore(noreclaim_flag);
> + fs_reclaim_release(gfp_mask);
> psi_memstall_leave(&pflags);
> delayacct_compact_end();
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
@ 2026-07-01 14:19 ` Vlastimil Babka (SUSE)
2026-07-01 15:28 ` Johannes Weiner
0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-07-01 14:19 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On 6/26/26 20:21, Johannes Weiner wrote:
> While trying to fix a reclaim storm in defrag_mode, I noticed that
> non-movable direct compaction is extremely inefficient.
>
> When searching for space to evacuate, compaction only allows blocks of
> the same type as the incoming request. This is to prevent migratetype
> pollution, where a small non-movable request frees space in a movable
> block and provokes the allocator to fall back and pollute it.
>
> This protection is reasonable on one hand, but the downside is that it
> makes non-movable direct compaction nearly useless: if we get the type
> annotations right, by definition there aren't any movable pages inside
> the non-movable blocks it is allowed to scan.
>
> With defrag_mode, the goal is the production of whole blocks, which
> are essentially type neutral: __rmqueue_claim() will convert them
> wholesale on alloc. This makes type mixing and pollution a non-issue.
>
> Fix the pollution gates to take the requested order into account, and
> allow whole-block requests to scan blocks of other types.
>
> The only exception is CMA blocks. That type is sticky and these blocks
> cannot be claimed to other types. Continue to be strict with them, and
> allow only explicit ALLOC_CMA requests and kcompactd to evacuate them.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
A suggestion:
> ---
> mm/compaction.c | 35 ++++++++++++++++++++++++++++-------
> 1 file changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f08765ade014..7df3a85d43af 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1381,12 +1381,33 @@ static bool suitable_migration_source(struct compact_control *cc,
> if (pageblock_skip_persistent(page))
> return false;
>
> - if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
> + /*
> + * Background compaction produces blocks for the zone at
> + * large, with no particular allocation context. Allow all
> + * block types, including CMA.
> + */
> + if (!cc->direct_compaction)
> return true;
>
> block_mt = get_pageblock_migratetype(page);
>
> - if (cc->migratetype == MIGRATE_MOVABLE)
> + /*
> + * CMA pages can only be taken by ALLOC_CMA requests. For anybody
> + * else, vacating a CMA block consumes free pages the caller
> + * could have used, and produces free pages it cannot.
> + */
> + if (is_migrate_cma(block_mt) && !(cc->alloc_flags & ALLOC_CMA))
> + return false;
> +
> + if (cc->mode != MIGRATE_ASYNC)
> + return true;
This now stands out as uncommented. Can we come up with a rationale? :)
> +
> + /*
> + * Prevent small unmovable/reclaimable requests from polluting
> + * movable blocks through fallbacks. Whole-block production is
> + * exempt as the allocator claims and converts these.
> + */
> + if (cc->migratetype == MIGRATE_MOVABLE || cc->order >= pageblock_order)
> return is_migrate_movable(block_mt);
> else
> return block_mt == cc->migratetype;
> @@ -1974,12 +1995,12 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
> return pfn;
>
> /*
> - * Only allow kcompactd and direct requests for movable pages to
> - * quickly clear out a MOVABLE pageblock for allocation. This
> - * reduces the risk that a large movable pageblock is freed for
> - * an unmovable/reclaimable small allocation.
> + * Prevent small unmovable/reclaimable requests from polluting
> + * movable blocks through fallbacks. Whole-block production is
> + * exempt as the allocator claims and converts these.
> */
> - if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE)
> + if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE &&
> + cc->order < pageblock_order)
> return pfn;
>
> /*
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests
2026-07-01 14:19 ` Vlastimil Babka (SUSE)
@ 2026-07-01 15:28 ` Johannes Weiner
2026-07-01 18:14 ` Vlastimil Babka (SUSE)
0 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2026-07-01 15:28 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On Wed, Jul 01, 2026 at 04:19:29PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/26/26 20:21, Johannes Weiner wrote:
> > While trying to fix a reclaim storm in defrag_mode, I noticed that
> > non-movable direct compaction is extremely inefficient.
> >
> > When searching for space to evacuate, compaction only allows blocks of
> > the same type as the incoming request. This is to prevent migratetype
> > pollution, where a small non-movable request frees space in a movable
> > block and provokes the allocator to fall back and pollute it.
> >
> > This protection is reasonable on one hand, but the downside is that it
> > makes non-movable direct compaction nearly useless: if we get the type
> > annotations right, by definition there aren't any movable pages inside
> > the non-movable blocks it is allowed to scan.
> >
> > With defrag_mode, the goal is the production of whole blocks, which
> > are essentially type neutral: __rmqueue_claim() will convert them
> > wholesale on alloc. This makes type mixing and pollution a non-issue.
> >
> > Fix the pollution gates to take the requested order into account, and
> > allow whole-block requests to scan blocks of other types.
> >
> > The only exception is CMA blocks. That type is sticky and these blocks
> > cannot be claimed to other types. Continue to be strict with them, and
> > allow only explicit ALLOC_CMA requests and kcompactd to evacuate them.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Thanks!
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index f08765ade014..7df3a85d43af 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1381,12 +1381,33 @@ static bool suitable_migration_source(struct compact_control *cc,
> > if (pageblock_skip_persistent(page))
> > return false;
> >
> > - if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
> > + /*
> > + * Background compaction produces blocks for the zone at
> > + * large, with no particular allocation context. Allow all
> > + * block types, including CMA.
> > + */
> > + if (!cc->direct_compaction)
> > return true;
> >
> > block_mt = get_pageblock_migratetype(page);
> >
> > - if (cc->migratetype == MIGRATE_MOVABLE)
> > + /*
> > + * CMA pages can only be taken by ALLOC_CMA requests. For anybody
> > + * else, vacating a CMA block consumes free pages the caller
> > + * could have used, and produces free pages it cannot.
> > + */
> > + if (is_migrate_cma(block_mt) && !(cc->alloc_flags & ALLOC_CMA))
> > + return false;
> > +
> > + if (cc->mode != MIGRATE_ASYNC)
> > + return true;
>
> This now stands out as uncommented. Can we come up with a rationale? :)
Let's see. Originally it came from here:
commit 9927af740b1b9b1e769310bd0b91425e8047b803
Author: Mel Gorman <mel@csn.ul.ie>
Date: Thu Jan 13 15:45:59 2011 -0800
mm: compaction: perform a faster migration scan when migrating asynchronously
This limited async scanners to movable blocks. By keeping them to the
most productive space, it keeps their latencies down.
But then there was a follow up here:
commit 282722b0d258ec23fc79d80165418fee83f01736
Author: Vlastimil Babka <vbabka@kernel.org>
Date: Mon May 8 15:54:49 2017 -0700
mm, compaction: restrict async compaction to pageblocks of same migratetype
This made the migratetype filtering about preventing block
pollution. The patch quotes reduced extfrag numbers.
So now we have a block pollution guard that we apply only if... the
scanner is latency sensitive? :) Is this actually desired behavior?
Another way of looking at it would be this:
/*
* Allocation fallbacks can spread migratable pages
* into non-movable blocks. This is high-effort,
* low-result work. Restrict it to sync scans.
*/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
@ 2026-07-01 18:02 ` Vlastimil Babka (SUSE)
2026-07-01 20:57 ` Johannes Weiner
0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-07-01 18:02 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On 6/26/26 20:21, Johannes Weiner wrote:
> The compaction capturing code assumes the allocation request order
> and compaction target order are the same. That won't be true once
> defrag_mode promotes sub-block allocations to pageblock-order
> compaction: compaction targets the larger order, capture should
> remain at the original allocation order.
Well I guess you could also try to capture the whole-pageblock page and then
deal with it as with whole pageblock stealing? But this works too and
perhaps it's simpler.
> Move the per-task capture_control to the page allocator, so its
> fields can carry alloc-side information that compaction's
> compact_control does not. Pass the capture_control through
> try_to_compact_pages() / compact_zone_order() instead of a bare
> struct page **; compact_zone_order() sets capc->cc while running.
>
> task_capc() now also checks capc->cc to handle the new
> not-yet-running state.
>
> No functional change.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/compaction.h | 3 ++-
> mm/compaction.c | 33 ++++++++++-----------------------
> mm/page_alloc.c | 23 +++++++++++++++++++++--
> 3 files changed, 33 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index f29ef0653546..66a2f70e9e01 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -58,6 +58,7 @@ enum compact_result {
> };
>
> struct alloc_context; /* in mm/internal.h */
> +struct capture_control; /* in mm/internal.h */
>
> /*
> * Number of free order-0 pages that should be available above given watermark
> @@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
> unsigned int order, unsigned int alloc_flags,
> const struct alloc_context *ac, enum compact_priority prio,
> - struct page **page);
> + struct capture_control *capc);
> extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern bool compaction_suitable(struct zone *zone, int order,
> unsigned long watermark, int highest_zoneidx);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7df3a85d43af..c2701bf1d04e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2791,7 +2791,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
> static enum compact_result compact_zone_order(struct zone *zone, int order,
> gfp_t gfp_mask, enum compact_priority prio,
> unsigned int alloc_flags, int highest_zoneidx,
> - struct page **capture)
> + struct capture_control *capc)
> {
> enum compact_result ret;
> struct compact_control cc = {
> @@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
> .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> };
> - struct capture_control capc = {
> - .cc = &cc,
> - .page = NULL,
> - };
>
> - /*
> - * Make sure the structs are really initialized before we expose the
> - * capture control, in case we are interrupted and the interrupt handler
> - * frees a page.
> - */
> + /* See the comment in __alloc_pages_direct_compact() */
> barrier();
> - WRITE_ONCE(current->capture_control, &capc);
> + WRITE_ONCE(capc->cc, &cc);
>
> - ret = compact_zone(&cc, &capc);
> + ret = compact_zone(&cc, capc);
> +
> + WRITE_ONCE(capc->cc, NULL);
I wonder if it makes sense to continue having capc->cc and this whole dance
in two functions. AFAICS (after patch 4/4) we access only capc->cc->zone and
capc->cc->migratetype. migratetype is stable in the whole
try_to_compact_pages(), could be part of capc. Order can be added by this
patch (with no semantic change to it) and not the next one. Zone varies, but
could be also in capc and set by try_to_compact_pages() before every call to
compact_zone_order(). Then compact_zone_order() doesn't have to set up any
capc fields anymore?
>
> - /*
> - * Make sure we hide capture control first before we read the captured
> - * page pointer, otherwise an interrupt could free and capture a page
> - * and we would leak it.
> - */
> - WRITE_ONCE(current->capture_control, NULL);
> - *capture = READ_ONCE(capc.page);
> /*
> * Technically, it is also possible that compaction is skipped but
> * the page is still captured out of luck(IRQ came and freed the page).
> * Returning COMPACT_SUCCESS in such cases helps in properly accounting
> * the COMPACT[STALL|FAIL] when compaction is skipped.
> */
> - if (*capture)
> + if (capc->page)
> ret = COMPACT_SUCCESS;
>
> return ret;
> @@ -2849,13 +2836,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
> * @alloc_flags: The allocation flags of the current allocation
> * @ac: The context of current allocation
> * @prio: Determines how hard direct compaction should try to succeed
> - * @capture: Pointer to free page created by compaction will be stored here
> + * @capc: The context for capturing pages during freeing
> *
> * This is the main entry point for direct page compaction.
> */
> enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> unsigned int alloc_flags, const struct alloc_context *ac,
> - enum compact_priority prio, struct page **capture)
> + enum compact_priority prio, struct capture_control *capc)
> {
> struct zoneref *z;
> struct zone *zone;
> @@ -2883,7 +2870,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> }
>
> status = compact_zone_order(zone, order, gfp_mask, prio,
> - alloc_flags, ac->highest_zoneidx, capture);
> + alloc_flags, ac->highest_zoneidx, capc);
> rc = max(status, rc);
>
> /* The allocation should succeed, stop compacting */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cb422505c6ef..9dee1c47e795 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
> {
> struct capture_control *capc = current->capture_control;
>
> - return unlikely(capc) &&
> + return unlikely(capc && capc->cc) &&
> !(current->flags & PF_KTHREAD) &&
> !capc->page &&
> capc->cc->zone == zone ? capc : NULL;
> @@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> unsigned long pflags;
> unsigned int noreclaim_flag;
> + struct capture_control capc = {
> + .page = NULL,
You didn't set .cc to NULL explicitly...
> + };
>
> if (!order)
> return NULL;
>
> + /*
> + * Make sure the structs are really initialized before we expose the
> + * capture control, in case we are interrupted and the interrupt handler
> + * frees a page.
> + */
> + barrier();
So either an implicit { } NULL / zero initialization + barrier() is enough
(I hope so) and we don't need to set NULL / zero in every field explicitly.
Or not and then we should set every field and not just page.
> + WRITE_ONCE(current->capture_control, &capc);
> +
> psi_memstall_enter(&pflags);
> delayacct_compact_start();
> fs_reclaim_acquire(gfp_mask);
> noreclaim_flag = memalloc_noreclaim_save();
>
> *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> - prio, &page);
> + prio, &capc);
>
> memalloc_noreclaim_restore(noreclaim_flag);
> fs_reclaim_release(gfp_mask);
> psi_memstall_leave(&pflags);
> delayacct_compact_end();
>
> + /*
> + * Make sure we hide capture control first before we read the captured
> + * page pointer, otherwise an interrupt could free and capture a page
> + * and we would leak it.
> + */
> + WRITE_ONCE(current->capture_control, NULL);
> + page = READ_ONCE(capc.page);
> +
> if (*compact_result == COMPACT_SKIPPED ||
> *compact_result == COMPACT_DEFERRED)
> return NULL;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
2026-06-26 18:29 ` Zi Yan
@ 2026-07-01 18:06 ` Vlastimil Babka (SUSE)
2026-07-01 21:02 ` Johannes Weiner
1 sibling, 1 reply; 16+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-07-01 18:06 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On 6/26/26 20:21, Johannes Weiner wrote:
> As we deployed defrag_mode into Meta production, pressure spikes and
> excessive swapping were observed on some workloads. Tracing confirmed
> that this is unmovable/reclaimable requests spinning in the allocator
> and direct reclaim, causing excessive amounts of swap.
>
> The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> produce blocks, and if those are overwhelmed under high pressure, let
> the allocator fall back (__rmqueue_steal()) after its retry loops.
> However, that retrying results in more reclaim on some of these
> workloads than we'd hoped, sometimes excessively so, spurred on by the
> !costly order conditions in should_reclaim_retry().
>
> The storms are dependent on the request type. Reclaim will inevitably
> make room in existing movable blocks, since that's where the LRU pages
> live. So if movable requests retry on reclaim, they make progress.
>
> When non-movable requests spin in reclaim that isn't productive. They
> cannot use the individually freed pages, and the process is unlikely
> to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> They spin and overreclaim excessively, which tanks performance and
> triggers userspace guards like swap exhaustion or pressure based OOM.
>
> To fix this, send non-movable requests, regardless of order, into
> pageblock reclaim/compaction. This way, they help move things along to
> meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> and excess OOM rates are no longer observed in production.
>
> The longer-term plan is still to have all requests, including the
> movable ones, help make blocks to spread the cost of defragmenting
> more evenly and fairly; combined with proper watermarking to reduce
> allocation latencies in the common case. However, doing this naively
> unearths scaling and concurrency limitations in compaction that need
> to be addressed first. Promoting just non-movables for now is the
> minimally viable bug fix for the above issue.
>
> Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
That's from 6.15. Do you intend any stable backporting, or we just mark it
as a heads up for anyone who tracks fixes and might consider it.
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
LGTM but as my suggestion for 3/4 would change it a lot, will wait with
formal tags.
> ---
> mm/internal.h | 7 +++++++
> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
> 2 files changed, 36 insertions(+), 7 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 181e79f1d6a2..1f636cfc859a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1060,6 +1060,13 @@ struct compact_control {
> */
> struct capture_control {
> struct compact_control *cc;
> + /*
> + * Allocation request order. May differ from the compaction
> + * order: defrag_mode promotes sub-block allocations to
> + * pageblock-order compaction; capture still matches at the
> + * original allocation order so prep_new_page() is consistent.
> + */
> + int order;
> struct page *page;
> };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dee1c47e795..575a99a4c723 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -728,7 +728,7 @@ static inline bool
> compaction_capture(struct capture_control *capc, struct page *page,
> int order, int migratetype)
> {
> - if (!capc || order != capc->cc->order)
> + if (!capc || order != capc->order)
> return false;
>
> /* Do not accidentally pollute CMA or isolated regions*/
> @@ -748,7 +748,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
> return false;
>
> if (migratetype != capc->cc->migratetype)
> - trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
> + trace_mm_page_alloc_extfrag(page, capc->order, order,
> capc->cc->migratetype, migratetype);
>
> capc->page = page;
> @@ -4147,10 +4147,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> unsigned long pflags;
> unsigned int noreclaim_flag;
> struct capture_control capc = {
> + .order = order,
> .page = NULL,
> };
> + int compact_order = order;
>
> - if (!order)
> + /*
> + * If fallbacks are not permitted (defrag_mode), we either
> + * need to reclaim space in a block of matching type, or clear
> + * out an entire block to allow __rmqueue_claim() to convert.
> + *
> + * Reclaim by itself is primarily freeing space in movable
> + * blocks, since that's where the LRU pages live. So this
> + * works for movable requests, but not for others.
> + *
> + * For those, promote the order to help make blocks, instead
> + * of spinning in reclaim alone unproductively.
> + */
> + if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
> + compact_order = max(order, pageblock_order);
> +
> + if (!compact_order)
> return NULL;
>
> /*
> @@ -4166,8 +4183,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> fs_reclaim_acquire(gfp_mask);
> noreclaim_flag = memalloc_noreclaim_save();
>
> - *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> - prio, &capc);
> + *compact_result = try_to_compact_pages(gfp_mask, compact_order,
> + alloc_flags, ac, prio, &capc);
>
> memalloc_noreclaim_restore(noreclaim_flag);
> fs_reclaim_release(gfp_mask);
> @@ -4203,7 +4220,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct zone *zone = page_zone(page);
>
> zone->compact_blockskip_flush = false;
> - compaction_defer_reset(zone, order, true);
> + compaction_defer_reset(zone, compact_order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
> @@ -4443,9 +4460,14 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> unsigned long pflags;
> bool drained = false;
> + int reclaim_order = order;
> +
> + /* Match the slowpath compaction promotion in __alloc_pages_direct_compact */
> + if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
> + reclaim_order = max(order, pageblock_order);
>
> psi_memstall_enter(&pflags);
> - *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
> + *did_some_progress = __perform_reclaim(gfp_mask, reclaim_order, ac);
> if (unlikely(!(*did_some_progress)))
> goto out;
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests
2026-07-01 15:28 ` Johannes Weiner
@ 2026-07-01 18:14 ` Vlastimil Babka (SUSE)
2026-07-01 21:11 ` Johannes Weiner
0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-07-01 18:14 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On 7/1/26 17:28, Johannes Weiner wrote:
> On Wed, Jul 01, 2026 at 04:19:29PM +0200, Vlastimil Babka (SUSE) wrote:
>> On 6/26/26 20:21, Johannes Weiner wrote:
>> > While trying to fix a reclaim storm in defrag_mode, I noticed that
>> > non-movable direct compaction is extremely inefficient.
>> >
>> > When searching for space to evacuate, compaction only allows blocks of
>> > the same type as the incoming request. This is to prevent migratetype
>> > pollution, where a small non-movable request frees space in a movable
>> > block and provokes the allocator to fall back and pollute it.
>> >
>> > This protection is reasonable on one hand, but the downside is that it
>> > makes non-movable direct compaction nearly useless: if we get the type
>> > annotations right, by definition there aren't any movable pages inside
>> > the non-movable blocks it is allowed to scan.
>> >
>> > With defrag_mode, the goal is the production of whole blocks, which
>> > are essentially type neutral: __rmqueue_claim() will convert them
>> > wholesale on alloc. This makes type mixing and pollution a non-issue.
>> >
>> > Fix the pollution gates to take the requested order into account, and
>> > allow whole-block requests to scan blocks of other types.
>> >
>> > The only exception is CMA blocks. That type is sticky and these blocks
>> > cannot be claimed to other types. Continue to be strict with them, and
>> > allow only explicit ALLOC_CMA requests and kcompactd to evacuate them.
>> >
>> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>
>> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>
> Thanks!
>
>> > diff --git a/mm/compaction.c b/mm/compaction.c
>> > index f08765ade014..7df3a85d43af 100644
>> > --- a/mm/compaction.c
>> > +++ b/mm/compaction.c
>> > @@ -1381,12 +1381,33 @@ static bool suitable_migration_source(struct compact_control *cc,
>> > if (pageblock_skip_persistent(page))
>> > return false;
>> >
>> > - if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
>> > + /*
>> > + * Background compaction produces blocks for the zone at
>> > + * large, with no particular allocation context. Allow all
>> > + * block types, including CMA.
>> > + */
>> > + if (!cc->direct_compaction)
>> > return true;
>> >
>> > block_mt = get_pageblock_migratetype(page);
>> >
>> > - if (cc->migratetype == MIGRATE_MOVABLE)
>> > + /*
>> > + * CMA pages can only be taken by ALLOC_CMA requests. For anybody
>> > + * else, vacating a CMA block consumes free pages the caller
>> > + * could have used, and produces free pages it cannot.
>> > + */
>> > + if (is_migrate_cma(block_mt) && !(cc->alloc_flags & ALLOC_CMA))
>> > + return false;
>> > +
>> > + if (cc->mode != MIGRATE_ASYNC)
>> > + return true;
>>
>> This now stands out as uncommented. Can we come up with a rationale? :)
>
> Let's see. Originally it came from here:
>
> commit 9927af740b1b9b1e769310bd0b91425e8047b803
> Author: Mel Gorman <mel@csn.ul.ie>
> Date: Thu Jan 13 15:45:59 2011 -0800
>
> mm: compaction: perform a faster migration scan when migrating asynchronously
>
> This limited async scanners to movable blocks. By keeping them to the
> most productive space, it keeps their latencies down.
>
> But then there was a follow up here:
>
> commit 282722b0d258ec23fc79d80165418fee83f01736
> Author: Vlastimil Babka <vbabka@kernel.org>
> Date: Mon May 8 15:54:49 2017 -0700
>
> mm, compaction: restrict async compaction to pageblocks of same migratetype
Aha :)
> This made the migratetype filtering about preventing block
> pollution. The patch quotes reduced extfrag numbers.
>
> So now we have a block pollution guard that we apply only if... the
> scanner is latency sensitive? :) Is this actually desired behavior?
Yeah indeed I was wondering in this direction.
> Another way of looking at it would be this:
>
> /*
> * Allocation fallbacks can spread migratable pages
> * into non-movable blocks.
But also vice versa, non-movable pages into movable blocks? (without
defrag_mode?).
> This is high-effort,
> * low-result work. Restrict it to sync scans.
> */
Works for me.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator
2026-07-01 18:02 ` Vlastimil Babka (SUSE)
@ 2026-07-01 20:57 ` Johannes Weiner
0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-07-01 20:57 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On Wed, Jul 01, 2026 at 08:02:55PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/26/26 20:21, Johannes Weiner wrote:
> > The compaction capturing code assumes the allocation request order
> > and compaction target order are the same. That won't be true once
> > defrag_mode promotes sub-block allocations to pageblock-order
> > compaction: compaction targets the larger order, capture should
> > remain at the original allocation order.
>
> Well I guess you could also try to capture the whole-pageblock page and then
> deal with it as with whole pageblock stealing? But this works too and
> perhaps it's simpler.
You mean capture the whole block in *page and then break off the chunk
we need? That's pretty far upstack. We don't hold the zone->lock at
that point to expand()... I think this is indeed a bit simpler.
> > @@ -2808,35 +2808,22 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
> > .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> > .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> > };
> > - struct capture_control capc = {
> > - .cc = &cc,
> > - .page = NULL,
> > - };
> >
> > - /*
> > - * Make sure the structs are really initialized before we expose the
> > - * capture control, in case we are interrupted and the interrupt handler
> > - * frees a page.
> > - */
> > + /* See the comment in __alloc_pages_direct_compact() */
> > barrier();
> > - WRITE_ONCE(current->capture_control, &capc);
> > + WRITE_ONCE(capc->cc, &cc);
> >
> > - ret = compact_zone(&cc, &capc);
> > + ret = compact_zone(&cc, capc);
> > +
> > + WRITE_ONCE(capc->cc, NULL);
>
> I wonder if it makes sense to continue having capc->cc and this whole dance
> in two functions. AFAICS (after patch 4/4) we access only capc->cc->zone and
> capc->cc->migratetype. migratetype is stable in the whole
> try_to_compact_pages(), could be part of capc. Order can be added by this
> patch (with no semantic change to it) and not the next one. Zone varies, but
> could be also in capc and set by try_to_compact_pages() before every call to
> compact_zone_order(). Then compact_zone_order() doesn't have to set up any
> capc fields anymore?
You mean something like this?
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index f29ef0653546..66a2f70e9e01 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,7 @@ enum compact_result {
};
struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */
/*
* Number of free order-0 pages that should be available above given watermark
@@ -92,7 +93,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
const struct alloc_context *ac, enum compact_priority prio,
- struct page **page);
+ struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
unsigned long watermark, int highest_zoneidx);
diff --git a/mm/compaction.c b/mm/compaction.c
index 7df3a85d43af..f37c130bc5cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2791,7 +2791,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int highest_zoneidx,
- struct page **capture)
+ struct capture_control *capc)
{
enum compact_result ret;
struct compact_control cc = {
@@ -2808,10 +2808,6 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
- struct capture_control capc = {
- .cc = &cc,
- .page = NULL,
- };
/*
* Make sure the structs are really initialized before we expose the
@@ -2819,9 +2815,9 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
* frees a page.
*/
barrier();
- WRITE_ONCE(current->capture_control, &capc);
+ WRITE_ONCE(current->capture_control, capc);
- ret = compact_zone(&cc, &capc);
+ ret = compact_zone(&cc, capc);
/*
* Make sure we hide capture control first before we read the captured
@@ -2829,14 +2825,14 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
* and we would leak it.
*/
WRITE_ONCE(current->capture_control, NULL);
- *capture = READ_ONCE(capc.page);
+
/*
* Technically, it is also possible that compaction is skipped but
* the page is still captured out of luck(IRQ came and freed the page).
* Returning COMPACT_SUCCESS in such cases helps in properly accounting
* the COMPACT[STALL|FAIL] when compaction is skipped.
*/
- if (*capture)
+ if (READ_ONCE(capc->page))
ret = COMPACT_SUCCESS;
return ret;
@@ -2849,13 +2845,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
* @alloc_flags: The allocation flags of the current allocation
* @ac: The context of current allocation
* @prio: Determines how hard direct compaction should try to succeed
- * @capture: Pointer to free page created by compaction will be stored here
+ * @capc: Free page capture bypassing the freelist
*
* This is the main entry point for direct page compaction.
*/
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
- enum compact_priority prio, struct page **capture)
+ enum compact_priority prio, struct capture_control *capc)
{
struct zoneref *z;
struct zone *zone;
@@ -2882,8 +2878,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
continue;
}
+ capc->zone = zone;
status = compact_zone_order(zone, order, gfp_mask, prio,
- alloc_flags, ac->highest_zoneidx, capture);
+ alloc_flags, ac->highest_zoneidx, capc);
rc = max(status, rc);
/* The allocation should succeed, stop compacting */
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..6d3402001b93 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1059,7 +1059,9 @@ struct compact_control {
* immediately when one is created during the free path.
*/
struct capture_control {
- struct compact_control *cc;
+ struct zone *zone;
+ int migratetype;
+ int order;
struct page *page;
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb422505c6ef..a2cceaaaccb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -721,14 +721,14 @@ static inline struct capture_control *task_capc(struct zone *zone)
return unlikely(capc) &&
!(current->flags & PF_KTHREAD) &&
!capc->page &&
- capc->cc->zone == zone ? capc : NULL;
+ capc->zone == zone ? capc : NULL;
}
static inline bool
compaction_capture(struct capture_control *capc, struct page *page,
int order, int migratetype)
{
- if (!capc || order != capc->cc->order)
+ if (!capc || order != capc->order)
return false;
/* Do not accidentally pollute CMA or isolated regions*/
@@ -744,12 +744,12 @@ compaction_capture(struct capture_control *capc, struct page *page,
* have trouble finding a high-order free page.
*/
if (order < pageblock_order && migratetype == MIGRATE_MOVABLE &&
- capc->cc->migratetype != MIGRATE_MOVABLE)
+ capc->migratetype != MIGRATE_MOVABLE)
return false;
- if (migratetype != capc->cc->migratetype)
- trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
- capc->cc->migratetype, migratetype);
+ if (migratetype != capc->migratetype)
+ trace_mm_page_alloc_extfrag(page, capc->order, order,
+ capc->migratetype, migratetype);
capc->page = page;
return true;
@@ -4146,6 +4146,12 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
+ struct capture_control capc = {
+ .zone = NULL,
+ .migratetype = ac->migratetype,
+ .order = order,
+ .page = NULL,
+ };
if (!order)
return NULL;
@@ -4156,13 +4162,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &page);
+ prio, &capc);
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
psi_memstall_leave(&pflags);
delayacct_compact_end();
+ page = READ_ONCE(capc.page);
+
if (*compact_result == COMPACT_SKIPPED ||
*compact_result == COMPACT_DEFERRED)
return NULL;
That could work as well, but let me know if that's what you thought.
The only thing I don't like so much is that it leaks that
READ_ONCE(capc.page) out of the function where we have those nice IRQ
preemption comments and the current->capture_control barriering.
> > @@ -718,7 +718,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
> > {
> > struct capture_control *capc = current->capture_control;
> >
> > - return unlikely(capc) &&
> > + return unlikely(capc && capc->cc) &&
> > !(current->flags & PF_KTHREAD) &&
> > !capc->page &&
> > capc->cc->zone == zone ? capc : NULL;
> > @@ -4146,23 +4146,42 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> > struct page *page = NULL;
> > unsigned long pflags;
> > unsigned int noreclaim_flag;
> > + struct capture_control capc = {
> > + .page = NULL,
>
> You didn't set .cc to NULL explicitly...
>
> > + };
> >
> > if (!order)
> > return NULL;
> >
> > + /*
> > + * Make sure the structs are really initialized before we expose the
> > + * capture control, in case we are interrupted and the interrupt handler
> > + * frees a page.
> > + */
> > + barrier();
>
> So either an implicit { } NULL / zero initialization + barrier() is enough
> (I hope so) and we don't need to set NULL / zero in every field explicitly.
> Or not and then we should set every field and not just page.
That partial capc = { .page = NULL, } will 0 the other fields. I don't
see a correctness issue. Or were you talking about readability?
Thanks for taking a look!
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
2026-07-01 18:06 ` Vlastimil Babka (SUSE)
@ 2026-07-01 21:02 ` Johannes Weiner
0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-07-01 21:02 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On Wed, Jul 01, 2026 at 08:06:03PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/26/26 20:21, Johannes Weiner wrote:
> > As we deployed defrag_mode into Meta production, pressure spikes and
> > excessive swapping were observed on some workloads. Tracing confirmed
> > that this is unmovable/reclaimable requests spinning in the allocator
> > and direct reclaim, causing excessive amounts of swap.
> >
> > The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> > produce blocks, and if those are overwhelmed under high pressure, let
> > the allocator fall back (__rmqueue_steal()) after its retry loops.
> > However, that retrying results in more reclaim on some of these
> > workloads than we'd hoped, sometimes excessively so, spurred on by the
> > !costly order conditions in should_reclaim_retry().
> >
> > The storms are dependent on the request type. Reclaim will inevitably
> > make room in existing movable blocks, since that's where the LRU pages
> > live. So if movable requests retry on reclaim, they make progress.
> >
> > When non-movable requests spin in reclaim that isn't productive. They
> > cannot use the individually freed pages, and the process is unlikely
> > to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> > They spin and overreclaim excessively, which tanks performance and
> > triggers userspace guards like swap exhaustion or pressure based OOM.
> >
> > To fix this, send non-movable requests, regardless of order, into
> > pageblock reclaim/compaction. This way, they help move things along to
> > meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> > and excess OOM rates are no longer observed in production.
> >
> > The longer-term plan is still to have all requests, including the
> > movable ones, help make blocks to spread the cost of defragmenting
> > more evenly and fairly; combined with proper watermarking to reduce
> > allocation latencies in the common case. However, doing this naively
> > unearths scaling and concurrency limitations in compaction that need
> > to be addressed first. Promoting just non-movables for now is the
> > minimally viable bug fix for the above issue.
> >
> > Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
>
> That's from 6.15. Do you intend any stable backporting, or we just mark it
> as a heads up for anyone who tracks fixes and might consider it.
Good point, let's Cc: stable.
I doubt there are many defrag_mode users at this point, but this is
quite the handgrenade that went off in production once already and was
a pain to debug.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests
2026-07-01 18:14 ` Vlastimil Babka (SUSE)
@ 2026-07-01 21:11 ` Johannes Weiner
0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-07-01 21:11 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Zi Yan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, linux-mm, linux-kernel
On Wed, Jul 01, 2026 at 08:14:05PM +0200, Vlastimil Babka (SUSE) wrote:
> On 7/1/26 17:28, Johannes Weiner wrote:
> > On Wed, Jul 01, 2026 at 04:19:29PM +0200, Vlastimil Babka (SUSE) wrote:
> >> On 6/26/26 20:21, Johannes Weiner wrote:
> >> > While trying to fix a reclaim storm in defrag_mode, I noticed that
> >> > non-movable direct compaction is extremely inefficient.
> >> >
> >> > When searching for space to evacuate, compaction only allows blocks of
> >> > the same type as the incoming request. This is to prevent migratetype
> >> > pollution, where a small non-movable request frees space in a movable
> >> > block and provokes the allocator to fall back and pollute it.
> >> >
> >> > This protection is reasonable on one hand, but the downside is that it
> >> > makes non-movable direct compaction nearly useless: if we get the type
> >> > annotations right, by definition there aren't any movable pages inside
> >> > the non-movable blocks it is allowed to scan.
> >> >
> >> > With defrag_mode, the goal is the production of whole blocks, which
> >> > are essentially type neutral: __rmqueue_claim() will convert them
> >> > wholesale on alloc. This makes type mixing and pollution a non-issue.
> >> >
> >> > Fix the pollution gates to take the requested order into account, and
> >> > allow whole-block requests to scan blocks of other types.
> >> >
> >> > The only exception is CMA blocks. That type is sticky and these blocks
> >> > cannot be claimed to other types. Continue to be strict with them, and
> >> > allow only explicit ALLOC_CMA requests and kcompactd to evacuate them.
> >> >
> >> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>
> >> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >
> > Thanks!
> >
> >> > diff --git a/mm/compaction.c b/mm/compaction.c
> >> > index f08765ade014..7df3a85d43af 100644
> >> > --- a/mm/compaction.c
> >> > +++ b/mm/compaction.c
> >> > @@ -1381,12 +1381,33 @@ static bool suitable_migration_source(struct compact_control *cc,
> >> > if (pageblock_skip_persistent(page))
> >> > return false;
> >> >
> >> > - if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
> >> > + /*
> >> > + * Background compaction produces blocks for the zone at
> >> > + * large, with no particular allocation context. Allow all
> >> > + * block types, including CMA.
> >> > + */
> >> > + if (!cc->direct_compaction)
> >> > return true;
> >> >
> >> > block_mt = get_pageblock_migratetype(page);
> >> >
> >> > - if (cc->migratetype == MIGRATE_MOVABLE)
> >> > + /*
> >> > + * CMA pages can only be taken by ALLOC_CMA requests. For anybody
> >> > + * else, vacating a CMA block consumes free pages the caller
> >> > + * could have used, and produces free pages it cannot.
> >> > + */
> >> > + if (is_migrate_cma(block_mt) && !(cc->alloc_flags & ALLOC_CMA))
> >> > + return false;
> >> > +
> >> > + if (cc->mode != MIGRATE_ASYNC)
> >> > + return true;
> >>
> >> This now stands out as uncommented. Can we come up with a rationale? :)
> >
> > Let's see. Originally it came from here:
> >
> > commit 9927af740b1b9b1e769310bd0b91425e8047b803
> > Author: Mel Gorman <mel@csn.ul.ie>
> > Date: Thu Jan 13 15:45:59 2011 -0800
> >
> > mm: compaction: perform a faster migration scan when migrating asynchronously
> >
> > This limited async scanners to movable blocks. By keeping them to the
> > most productive space, it keeps their latencies down.
> >
> > But then there was a follow up here:
> >
> > commit 282722b0d258ec23fc79d80165418fee83f01736
> > Author: Vlastimil Babka <vbabka@kernel.org>
> > Date: Mon May 8 15:54:49 2017 -0700
> >
> > mm, compaction: restrict async compaction to pageblocks of same migratetype
>
> Aha :)
>
> > This made the migratetype filtering about preventing block
> > pollution. The patch quotes reduced extfrag numbers.
> >
> > So now we have a block pollution guard that we apply only if... the
> > scanner is latency sensitive? :) Is this actually desired behavior?
>
> Yeah indeed I was wondering in this direction.
>
> > Another way of looking at it would be this:
> >
> > /*
> > * Allocation fallbacks can spread migratable pages
> > * into non-movable blocks.
>
> But also vice versa, non-movable pages into movable blocks? (without
> defrag_mode?).
Uhm but those aren't compactable anymore then, right?
There is a flipside, but it isn't quite symmetrical. Movable requests
are allowed to look for movables in unmovable blocks once they become
sync (high-effort, low-result). Non-movable requests are finally
allowed to empty movable blocks once they become sync; they *start*
with the high-effort, low-result mode to avoid block contamination but
are allowed to escalate when that doesn't produce results.
I'll try to work that second part into the comment as well.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-07-01 21:11 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
2026-07-01 13:45 ` Vlastimil Babka (SUSE)
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
2026-07-01 14:19 ` Vlastimil Babka (SUSE)
2026-07-01 15:28 ` Johannes Weiner
2026-07-01 18:14 ` Vlastimil Babka (SUSE)
2026-07-01 21:11 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2026-07-01 18:02 ` Vlastimil Babka (SUSE)
2026-07-01 20:57 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
2026-06-26 18:29 ` Zi Yan
2026-06-26 18:43 ` Johannes Weiner
2026-07-01 18:06 ` Vlastimil Babka (SUSE)
2026-07-01 21:02 ` Johannes Weiner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.