From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
Rik van Riel <riel@fb.com>, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order
Date: Thu, 30 Apr 2026 16:21:14 -0400 [thread overview]
Message-ID: <20260430202233.111010-46-riel@surriel.com> (raw)
In-Reply-To: <20260430202233.111010-1-riel@surriel.com>
From: Rik van Riel <riel@fb.com>
The slowpath in __alloc_pages_slowpath calls spb_evacuate_for_order
just before dropping ALLOC_NOFRAGMENT. Each successful evacuation
frees a MOV pageblock inside a tainted SPB so the retry can satisfy
a non-movable allocation via Pass 2 (claim_whole_block) without
having to drop NOFRAGMENT and let __rmqueue_claim taint a clean SPB.
Two problems with the existing implementation:
1. Per-call budget too small.
SPB_CONTIG_MAX_CANDIDATES = 4 (also used by contig allocation)
per-SPB pageblocks = 3 (hard-coded literal in evac calls)
→ up to 12 pageblocks scanned/migrated per call.
Production traces show 13 of 21 tainted SPBs typically have MOV
content (~2500 PB-units = ~5 GiB MOV-having pageblocks across the
tainted pool). The 4-candidate cap leaves the bulk of evacuatable
capacity untouched per call, so the slowpath frequently gives up
and drops NOFRAGMENT even though plenty of MOV content was there
to free.
2. Source-pageblock migratetype filter creates blind spots.
evacuate_pb_range(..., migratetype, ...) skipped any pageblock
whose underlying tag did not match the @migratetype argument.
Phase 1 used the requesting type; Phase 2 used MIGRATE_MOVABLE.
But MOV content can live in any tag of pageblock:
- PASS_2C / PASS_2D borrows set PB_has_<requesting_mt> on a
MOV-tagged PB without changing the tag, then borrowed pages
return to the MOV free list when freed.
- __spb_set_has_type adds a non-MOV bit on a PB without
re-evaluating the PB tag. PBs accumulate has-bits over time.
Result: MOV stragglers in PBs whose tag does not match either
phase's filter are permanently invisible to evacuation.
Fix both:
* Introduce dedicated SPB_EVACUATE_MAX_CANDIDATES = 16 and
SPB_EVACUATE_MAX_PB_PER_SB = 8 so the evacuation budget can be
sized independently of the contig-allocation candidate cap.
Combined cap: 128 pageblocks (256 MiB) per call.
* Drop the migratetype tag filter from evacuate_pb_range. Accept
any pageblock with PB_has_movable set, skipping only the cases
whose semantics forbid touching them here (ISOLATE, CMA,
HIGHATOMIC).
* Collapse the two-phase structure in spb_evacuate_for_order into
a single pass. The two phases were doing the same evacuation
action with different filters; once the filter is relaxed the
distinction collapses naturally:
- PBs that are pure MOV become empty -> free MOV pageblock,
claimable by Pass 2 / claim_whole_block on the retry.
- PBs that are mixed lose their MOV stragglers, so future
allocations of the dominant type can use the PB without
competing with MOV residue.
* sb_collect_evacuate_candidates loses its migratetype parameter:
after the unified pass, the only candidate filter is nr_movable
> 0.
The contig-allocation path (spb_try_alloc_contig) is unchanged;
SPB_CONTIG_MAX_CANDIDATES remains at 4. 1 GiB allocations have a
different latency profile and broad evac-style scanning is not
appropriate there.
The trace_spb_evacuate_for_order_done signature is preserved for ABI
continuity with existing observers; the merged attempt count is
reported in phase1_attempts and phase2_attempts is reported as 0.
Stack impact: sb_pfns[] grows from 32 bytes to 128 bytes — trivial
for an 8K/16K kernel stack.
Per Rik's stated priority hierarchy:
P1 protect clean SPBs from being tainted (highest)
P2 protect MOV pageblocks inside tainted SPBs
P3 allocation latency (lowest)
trading a few hundred ms of evacuation latency to keep clean SPBs
clean is the desired direction.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
mm/page_alloc.c | 141 ++++++++++++++++++++++++++++--------------------
1 file changed, 84 insertions(+), 57 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 815cee325ec0..9cc8b9bbc1fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -10603,6 +10603,27 @@ static bool zone_spans_last_pfn(const struct zone *zone,
*/
#define SPB_CONTIG_MAX_CANDIDATES 4
+/*
+ * Maximum tainted superpageblock candidates per spb_evacuate_for_order call.
+ * Collected under zone->lock, then evacuated without it. Larger than the
+ * contig-allocation candidate cap because evacuation runs from the slowpath
+ * after reclaim/compaction failed: we need a meaningful chance of freeing a
+ * non-MOV-claimable pageblock before the slowpath escalates to dropping
+ * ALLOC_NOFRAGMENT (which lets __rmqueue_claim taint clean SPBs). Sized to
+ * scan a meaningful fraction of a typical tainted-pool population.
+ */
+#define SPB_EVACUATE_MAX_CANDIDATES 16
+
+/*
+ * Maximum pageblocks to evacuate per candidate SPB inside
+ * spb_evacuate_for_order. Each evacuation triggers page migration which is
+ * O(pages_per_pageblock) wall-clock cost, so this caps per-call latency.
+ * Bumped from 3 to 8 to free more capacity per slowpath escalation pass.
+ * Combined cap: SPB_EVACUATE_MAX_CANDIDATES * SPB_EVACUATE_MAX_PB_PER_SB
+ * pageblocks per call (16 * 8 = 128 = 256 MiB on x86 max migration budget).
+ */
+#define SPB_EVACUATE_MAX_PB_PER_SB 8
+
#ifdef CONFIG_COMPACTION
/**
* sb_collect_contig_candidates - Find superpageblock ranges for contiguous alloc
@@ -10778,7 +10799,7 @@ static struct page *spb_try_alloc_contig(struct zone *zone,
*
* Returns number of candidate superpageblock PFNs found.
*/
-static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
+static int sb_collect_evacuate_candidates(struct zone *zone,
unsigned long *sb_pfns, int max)
{
struct superpageblock *sb;
@@ -10792,20 +10813,6 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
if (!sb->nr_movable)
continue;
- if (migratetype >= 0) {
- bool has_matching;
-
- if (migratetype == MIGRATE_UNMOVABLE)
- has_matching = sb->nr_unmovable > 0;
- else if (migratetype == MIGRATE_RECLAIMABLE)
- has_matching = sb->nr_reclaimable > 0;
- else
- continue;
-
- if (!has_matching)
- continue;
- }
-
sb_pfns[n++] = sb->start_pfn;
if (n >= max)
return n;
@@ -10815,17 +10822,38 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
}
/*
- * Evacuate pageblocks of the given migratetype within a range.
+ * Evacuate MOV content out of any pageblock in the given range that has it.
+ *
+ * The previous version filtered on the source pageblock's migratetype tag,
+ * which made evacuation blind to MOV stragglers living in PBs whose tag did
+ * not match the current allocation's requesting type:
+ *
+ * - PASS_2C / PASS_2D borrows set PB_has_<requesting_mt> on a MOV-tagged
+ * PB without changing the tag. The borrowed pages return to the MOV
+ * free list when freed, so a MOV-tagged PB can host non-MOV PB_has bits
+ * and MOV content simultaneously.
+ *
+ * - When __spb_set_has_type adds a non-MOV bit on a PB, the PB tag is not
+ * re-evaluated. PBs accumulate has-bits over time without their tag
+ * necessarily reflecting current content.
+ *
+ * Drop the migratetype tag filter and accept any PB with PB_has_movable set.
+ * Skip only the cases whose semantics forbid touching them here:
+ * - MIGRATE_ISOLATE under quarantine
+ * - CMA own allocator
+ * - MIGRATE_HIGHATOMIC reserve, evac would race the reservation logic
+ *
* Returns number of pageblocks evacuated.
*/
static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn, int migratetype, int max)
+ unsigned long end_pfn, int max)
{
unsigned long pfn;
int nr_evacuated = 0;
for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
struct page *page;
+ int pb_mt;
if (!pfn_valid(pfn))
continue;
@@ -10835,10 +10863,13 @@ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
page = pfn_to_page(pfn);
- if (get_pfnblock_migratetype(page, pfn) != migratetype)
+ if (!get_pfnblock_bit(page, pfn, PB_has_movable))
continue;
- if (!get_pfnblock_bit(page, pfn, PB_has_movable))
+ pb_mt = get_pfnblock_migratetype(page, pfn);
+ if (pb_mt == MIGRATE_ISOLATE ||
+ is_migrate_cma(pb_mt) ||
+ pb_mt == MIGRATE_HIGHATOMIC)
continue;
evacuate_pageblock(zone, pfn, true);
@@ -10870,41 +10901,33 @@ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
int migratetype)
{
- unsigned long sb_pfns[SPB_CONTIG_MAX_CANDIDATES];
+ unsigned long sb_pfns[SPB_EVACUATE_MAX_CANDIDATES];
unsigned long flags;
int nr_sbs, i;
- unsigned int phase1_attempts = 0, phase2_attempts = 0;
+ unsigned int attempts = 0;
bool did_evacuate = false;
- /* Phase 1: coalesce within existing non-movable pageblocks */
- spin_lock_irqsave(&zone->lock, flags);
- nr_sbs = sb_collect_evacuate_candidates(zone, migratetype,
- sb_pfns,
- SPB_CONTIG_MAX_CANDIDATES);
- spin_unlock_irqrestore(&zone->lock, flags);
-
- for (i = 0; i < nr_sbs; i++) {
- unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
- int n;
-
- n = evacuate_pb_range(zone, sb_pfns[i], end_pfn,
- migratetype, 3);
- phase1_attempts += n;
- if (n)
- did_evacuate = true;
- }
-
- if (did_evacuate) {
- trace_spb_evacuate_for_order_done(zone, order, migratetype,
- phase1_attempts, phase2_attempts, true);
- return true;
- }
-
- /* Phase 2: evacuate MOVABLE pageblocks to create free whole pageblocks */
+ /*
+ * Single-pass evacuation: collect candidate tainted SPBs (anything
+ * with MOV content), then walk each one's pageblocks evacuating MOV
+ * content from any non-special PB. evacuate_pb_range filters by
+ * PB_has_movable, so this is a no-op on PBs that have no MOV content.
+ *
+ * Two effects accumulate:
+ * - PBs that are pure MOV become empty -> free MOV pageblock,
+ * claimable by Pass 2 / claim_whole_block on the retry.
+ * - PBs that are mixed (e.g., UNMOV + MOV stragglers) lose the MOV
+ * stragglers, so future allocations of the dominant type can use
+ * the PB without competing with the MOV residue.
+ *
+ * The previous two-phase design tried to do these separately and
+ * filtered evacuation by source PB tag. That left MOV content
+ * stranded in PBs whose tag did not match either phase, and gave up
+ * after one phase even though the other phase could have helped.
+ */
spin_lock_irqsave(&zone->lock, flags);
- nr_sbs = sb_collect_evacuate_candidates(zone, -1,
- sb_pfns,
- SPB_CONTIG_MAX_CANDIDATES);
+ nr_sbs = sb_collect_evacuate_candidates(zone, sb_pfns,
+ SPB_EVACUATE_MAX_CANDIDATES);
spin_unlock_irqrestore(&zone->lock, flags);
for (i = 0; i < nr_sbs; i++) {
@@ -10912,24 +10935,28 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
int n;
n = evacuate_pb_range(zone, sb_pfns[i], end_pfn,
- MIGRATE_MOVABLE, 3);
- phase2_attempts += n;
+ SPB_EVACUATE_MAX_PB_PER_SB);
+ attempts += n;
if (n)
did_evacuate = true;
}
/*
* Always kick a slab shrink after an evacuation pass — even when
- * movable evacuation succeeded. Slab content stranded inside
- * tainted SPBs can only be freed by shrinking the cache; doing
- * it now keeps headroom available for the next burst, when the
- * movable supply may have run out and movable evac alone would
- * have nothing to do.
+ * MOV evacuation succeeded. Slab content stranded inside tainted
+ * SPBs can only be freed by shrinking the cache; doing it now keeps
+ * headroom available for the next burst, when the MOV supply may
+ * have run out and evac alone would have nothing to do.
*/
queue_spb_slab_shrink(zone);
+ /*
+ * The tracepoint signature retains phase1_attempts / phase2_attempts
+ * for ABI continuity with existing observers; report the merged total
+ * in phase1_attempts and 0 in phase2_attempts.
+ */
trace_spb_evacuate_for_order_done(zone, order, migratetype,
- phase1_attempts, phase2_attempts, did_evacuate);
+ attempts, 0, did_evacuate);
return did_evacuate;
}
#endif /* CONFIG_COMPACTION */
--
2.52.0
next prev parent reply other threads:[~2026-04-30 20:22 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
2026-04-30 20:21 ` Rik van Riel [this message]
2026-05-01 7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
2026-05-01 11:58 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260430202233.111010-46-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@fb.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox