From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
Rik van Riel <riel@meta.com>, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks
Date: Thu, 30 Apr 2026 16:20:51 -0400 [thread overview]
Message-ID: <20260430202233.111010-23-riel@surriel.com> (raw)
In-Reply-To: <20260430202233.111010-1-riel@surriel.com>
From: Rik van Riel <riel@meta.com>
Add Phase 2 to rmqueue_bulk: when refilling PCP for unmovable or
reclaimable allocations, search tainted superpageblocks for partially-free
pageblocks with sub-pageblock buddy entries of the requested migratetype.
Claim ownership of the pageblock and move the found entry to PCP with
PCPBuddy marking. Pass 0 (the existing owned-block recovery phase)
picks up remaining buddy entries on subsequent refills, so there is no
need to sweep the entire pageblock eagerly.
This concentrates non-movable allocations into already-tainted
superpageblocks, reducing fragmentation spread to clean superpageblocks.
Before claiming ownership, verify the pageblock is not already owned by
another CPU (pbd->cpu == 0). Without this check, two CPUs could have
PCPBuddy pages from the same pageblock on separate PCP lists protected
by different locks, and the PCP merge pass could corrupt the other
CPU's list.
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/page_alloc.c | 114 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 101 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f925b5a2e5f..4f8105b89e47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1130,7 +1130,7 @@ static inline void set_buddy_order(struct page *page, unsigned int order)
* - Set when Phase 0/1 restore or acquire whole pageblocks.
* - Propagated to split remainders in pcp_rmqueue_smallest().
* - Set on freed pages from owned blocks routed to the owner PCP.
- * - NOT set for Phase 2/3 fragments or zone-owned frees.
+ * - NOT set for Phase 3 fragments or zone-owned frees.
* - The merge pass in free_pcppages_bulk() only processes
* PagePCPBuddy pages, ensuring it never touches pages on
* another CPU's PCP list.
@@ -3840,15 +3840,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
* under a single hold of the lock, for efficiency. Add them to the
* freelist of @pcp.
*
- * When @pcp is non-NULL and @count > 1 (normal pageset), uses a four-phase
+ * When @pcp is non-NULL and @count > 1 (normal pageset), uses a multi-phase
* approach:
- * Phase 0: Recover previously owned, partially drained blocks.
- * Phase 1: Acquire whole pageblocks, claim ownership, set PagePCPBuddy.
- * These pages are eligible for PCP-level buddy merging.
- * Phase 2: Grab sub-pageblock fragments of the same migratetype.
- * Phase 3: Fall back to __rmqueue() with migratetype fallback.
- * Phase 2/3 pages are cached for batching only -- no ownership claim,
- * no PagePCPBuddy, no PCP-level merging.
+ * Phase 0: Recover previously owned, partially drained blocks.
+ * Phase 1: Acquire whole pageblocks, claim ownership, set PagePCPBuddy.
+ * These pages are eligible for PCP-level buddy merging.
+ * Phase 2: Adopt partial pageblocks from tainted SPBs (non-movable only).
+ * Claims ownership so Pass 0 can recover buddy entries later.
+ * Phase 3: Fall back to __rmqueue() with migratetype fallback.
+ * No ownership claim, no PagePCPBuddy, no PCP-level merging.
*
* When @pcp is NULL or @count <= 1 (boot pageset), acquires individual
* pages of the requested order directly.
@@ -3976,11 +3976,99 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
goto out;
/*
- * Phase 2 was removed: it swept zone free lists for sub-pageblock
- * fragments, which are always empty when superpageblocks are enabled.
- * Phase 3's __rmqueue() -> __rmqueue_smallest() properly searches
- * per-superpageblock free lists at all orders.
+ * Phase 2: Adopt partial pageblocks from tainted SPBs.
+ *
+ * Phase 1 only grabs whole free pageblocks. When a tainted SPB
+ * has partially-used pageblocks with free sub-pageblock buddy
+ * entries, Phase 1 can't use them. Phase 3 can find them via
+ * __rmqueue_smallest, but without ownership or PCPBuddy marking,
+ * so they fragment further on drain.
+ *
+ * This phase bridges the gap: find a sub-pageblock free entry
+ * in a tainted SPB and claim ownership of its pageblock. Pass 0
+ * will pick up remaining buddy entries on subsequent refills.
+ *
+ * Only for unmovable/reclaimable — movable should use clean SPBs.
*/
+ if (migratetype != MIGRATE_MOVABLE &&
+ !is_migrate_cma(migratetype)) {
+ enum sb_fullness full;
+
+ for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+ struct superpageblock *sb;
+
+ list_for_each_entry(sb,
+ &zone->spb_lists[SB_TAINTED][full], list) {
+ struct page *page;
+ int found_order = -1;
+
+ if (sb->nr_free_pages < pageblock_nr_pages / 4)
+ continue;
+
+ /*
+ * Find a sub-pageblock free entry for our
+ * migratetype, starting from the largest order.
+ */
+ for (o = pageblock_order - 1; o >= order; o--) {
+ struct free_area *area;
+
+ area = &sb->free_area[o];
+ page = get_page_from_free_area(
+ area, migratetype);
+ if (page) {
+ found_order = o;
+ break;
+ }
+ }
+ if (found_order < 0)
+ continue;
+
+ /*
+ * Check that this pageblock isn't already
+ * owned by another CPU. If it is, two CPUs
+ * would have PCPBuddy pages from the same
+ * pageblock, and the PCP merge pass could
+ * corrupt the other CPU's PCP list.
+ */
+ pbd = pfn_to_pageblock(page,
+ page_to_pfn(page));
+ if (pbd->cpu != 0)
+ continue;
+
+ /*
+ * Found a free chunk in an unowned pageblock.
+ * Take it from buddy, claim ownership, and
+ * set PCPBuddy. Pass 0 will grab remaining
+ * buddy entries on future refills.
+ *
+ * Set PB_has_<migratetype> since we bypass
+ * page_del_and_expand (which normally does
+ * PB_has tracking).
+ */
+ del_page_from_free_list(page, zone,
+ found_order,
+ migratetype);
+ __spb_set_has_type(page, migratetype);
+ set_pcpblock_owner(page, cpu);
+ __SetPagePCPBuddy(page);
+ pcp_enqueue_tail(pcp, page, migratetype,
+ found_order);
+ refilled += 1 << found_order;
+
+ /*
+ * Register for Phase 0 recovery so future
+ * drains from this pageblock can be swept
+ * back efficiently.
+ */
+ if (list_empty(&pbd->cpu_node))
+ list_add(&pbd->cpu_node,
+ &pcp->owned_blocks);
+
+ if (refilled >= pages_needed)
+ goto out;
+ }
+ }
+ }
/*
* Phase 3: Last resort. Use __rmqueue() which does
--
2.52.0
next prev parent reply other threads:[~2026-04-30 20:22 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-04-30 20:20 ` Rik van Riel [this message]
2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
2026-05-01 7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
2026-05-01 11:58 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260430202233.111010-23-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@meta.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox