From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Brendan Jackman <jackmanb@google.com>, Zi Yan <ziy@nvidia.com>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
Mike Rapoport <rppt@kernel.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode
Date: Wed, 1 Jul 2026 20:06:03 +0200 [thread overview]
Message-ID: <3f013cf5-008a-4207-85ce-d6f7c0296d99@kernel.org> (raw)
In-Reply-To: <20260626182215.1107966-5-hannes@cmpxchg.org>
On 6/26/26 20:21, Johannes Weiner wrote:
> As we deployed defrag_mode into Meta production, pressure spikes and
> excessive swapping were observed on some workloads. Tracing confirmed
> that this is unmovable/reclaimable requests spinning in the allocator
> and direct reclaim, causing excessive amounts of swap.
>
> The initial plan for defrag_mode was to rely on kswapd/kcompactd to
> produce blocks, and if those are overwhelmed under high pressure, let
> the allocator fall back (__rmqueue_steal()) after its retry loops.
> However, that retrying results in more reclaim on some of these
> workloads than we'd hoped, sometimes excessively so, spurred on by the
> !costly order conditions in should_reclaim_retry().
>
> The storms are dependent on the request type. Reclaim will inevitably
> make room in existing movable blocks, since that's where the LRU pages
> live. So if movable requests retry on reclaim, they make progress.
>
> When non-movable requests spin in reclaim that isn't productive. They
> cannot use the individually freed pages, and the process is unlikely
> to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
> They spin and overreclaim excessively, which tanks performance and
> triggers userspace guards like swap exhaustion or pressure based OOM.
>
> To fix this, send non-movable requests, regardless of order, into
> pageblock reclaim/compaction. This way, they help move things along to
> meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
> and excess OOM rates are no longer observed in production.
>
> The longer-term plan is still to have all requests, including the
> movable ones, help make blocks to spread the cost of defragmenting
> more evenly and fairly; combined with proper watermarking to reduce
> allocation latencies in the common case. However, doing this naively
> unearths scaling and concurrency limitations in compaction that need
> to be addressed first. Promoting just non-movables for now is the
> minimally viable bug fix for the above issue.
>
> Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
That's from 6.15. Do you intend any stable backporting, or we just mark it
as a heads up for anyone who tracks fixes and might consider it.
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
LGTM but as my suggestion for 3/4 would change it a lot, will wait with
formal tags.
> ---
> mm/internal.h | 7 +++++++
> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
> 2 files changed, 36 insertions(+), 7 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 181e79f1d6a2..1f636cfc859a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1060,6 +1060,13 @@ struct compact_control {
> */
> struct capture_control {
> struct compact_control *cc;
> + /*
> + * Allocation request order. May differ from the compaction
> + * order: defrag_mode promotes sub-block allocations to
> + * pageblock-order compaction; capture still matches at the
> + * original allocation order so prep_new_page() is consistent.
> + */
> + int order;
> struct page *page;
> };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dee1c47e795..575a99a4c723 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -728,7 +728,7 @@ static inline bool
> compaction_capture(struct capture_control *capc, struct page *page,
> int order, int migratetype)
> {
> - if (!capc || order != capc->cc->order)
> + if (!capc || order != capc->order)
> return false;
>
> /* Do not accidentally pollute CMA or isolated regions*/
> @@ -748,7 +748,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
> return false;
>
> if (migratetype != capc->cc->migratetype)
> - trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
> + trace_mm_page_alloc_extfrag(page, capc->order, order,
> capc->cc->migratetype, migratetype);
>
> capc->page = page;
> @@ -4147,10 +4147,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> unsigned long pflags;
> unsigned int noreclaim_flag;
> struct capture_control capc = {
> + .order = order,
> .page = NULL,
> };
> + int compact_order = order;
>
> - if (!order)
> + /*
> + * If fallbacks are not permitted (defrag_mode), we either
> + * need to reclaim space in a block of matching type, or clear
> + * out an entire block to allow __rmqueue_claim() to convert.
> + *
> + * Reclaim by itself is primarily freeing space in movable
> + * blocks, since that's where the LRU pages live. So this
> + * works for movable requests, but not for others.
> + *
> + * For those, promote the order to help make blocks, instead
> + * of spinning in reclaim alone unproductively.
> + */
> + if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
> + compact_order = max(order, pageblock_order);
> +
> + if (!compact_order)
> return NULL;
>
> /*
> @@ -4166,8 +4183,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> fs_reclaim_acquire(gfp_mask);
> noreclaim_flag = memalloc_noreclaim_save();
>
> - *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> - prio, &capc);
> + *compact_result = try_to_compact_pages(gfp_mask, compact_order,
> + alloc_flags, ac, prio, &capc);
>
> memalloc_noreclaim_restore(noreclaim_flag);
> fs_reclaim_release(gfp_mask);
> @@ -4203,7 +4220,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct zone *zone = page_zone(page);
>
> zone->compact_blockskip_flush = false;
> - compaction_defer_reset(zone, order, true);
> + compaction_defer_reset(zone, compact_order, true);
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
> @@ -4443,9 +4460,14 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> unsigned long pflags;
> bool drained = false;
> + int reclaim_order = order;
> +
> + /* Match the slowpath compaction promotion in __alloc_pages_direct_compact */
> + if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
> + reclaim_order = max(order, pageblock_order);
>
> psi_memstall_enter(&pflags);
> - *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
> + *did_some_progress = __perform_reclaim(gfp_mask, reclaim_order, ac);
> if (unlikely(!(*did_some_progress)))
> goto out;
>
next prev parent reply other threads:[~2026-07-01 18:06 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-26 18:21 [PATCH 0/4] mm: fix reclaim storms in defrag_mode Johannes Weiner
2026-06-26 18:21 ` [PATCH 1/4] mm: page_alloc: __GFP_FS lockdep annotation for direct compaction Johannes Weiner
2026-07-01 13:45 ` Vlastimil Babka (SUSE)
2026-06-26 18:21 ` [PATCH 2/4] mm: compaction: support non-movable compaction for pageblock requests Johannes Weiner
2026-07-01 14:19 ` Vlastimil Babka (SUSE)
2026-07-01 15:28 ` Johannes Weiner
2026-07-01 18:14 ` Vlastimil Babka (SUSE)
2026-07-01 21:11 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 3/4] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2026-07-01 18:02 ` Vlastimil Babka (SUSE)
2026-07-01 20:57 ` Johannes Weiner
2026-06-26 18:21 ` [PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode Johannes Weiner
2026-06-26 18:29 ` Zi Yan
2026-06-26 18:43 ` Johannes Weiner
2026-07-01 18:06 ` Vlastimil Babka (SUSE) [this message]
2026-07-01 21:02 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3f013cf5-008a-4207-85ce-d6f7c0296d99@kernel.org \
--to=vbabka@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.